Open Menu
AllLocalCommunitiesAbout
lotide
AllLocalCommunitiesAbout
Login

Perplexity AI is complaining their plagiarism bot machine cannot bypass Cloudflare's firewall

⁨0⁩ ⁨likes⁩

Submitted ⁨⁨7⁩ ⁨months⁩ ago⁩ by ⁨Davriellelouna@lemmy.world⁩ to ⁨technology@lemmy.world⁩

https://www.searchenginejournal.com/perplexity-says-cloudflare-is-blocking-legitimate-ai-assistants/552927/

source

Comments

Sort:hotnewtop
  • tarknassus@lemmy.world ⁨7⁩ ⁨months⁩ ago

    I don’t see a problem here. Maybe Perplexity should consider the reasons WHY Cloudflare have a firewall…?

    source
  • Jimmycrackcrack@lemmy.ml ⁨7⁩ ⁨months⁩ ago

    Gee that’s a real removed it ain’t it perplexity?

    source
  • Electricd@lemmybefree.net ⁨7⁩ ⁨months⁩ ago

    They do have a point though. I would be great to let per-prompt searches go through, but not mass scrapping

    source
    • threeganzi@sh.itjust.works ⁨7⁩ ⁨months⁩ ago

      Does it not need to be scraped to be indexed, assuming it’s semi-typical RAG stuff?

      source
      • Electricd@lemmybefree.net ⁨7⁩ ⁨months⁩ ago

        I assume their script does some search engine stuff like query google or bing and then “scrap” the links they go on

        Some selenium stuff

        source
  • Electricd@lemmybefree.net ⁨7⁩ ⁨months⁩ ago

    I don’t like cloudflare but it’s nice that they allow people to stop AI scrapping if they want to

    source
    • tempest@lemmy.ca ⁨7⁩ ⁨months⁩ ago

      CloudFlare has become an Internet protection racket and I’m not happy about it.

      source
      • Electricd@lemmybefree.net ⁨7⁩ ⁨months⁩ ago

        they’re good at it but damn, having a company being MITM feels so wrong

        source
        • -> View More Comments
      • Laser@feddit.org ⁨7⁩ ⁨months⁩ ago

        It’s been this from the very beginning. But they don’t fit the definition of a protection racket as they’re not the ones attacking you if you don’t pay up. So they’re more like a security company that has no competitors due to the needed investment to operate.

        source
  • drmoose@lemmy.world ⁨7⁩ ⁨months⁩ ago

    It’s insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

    Cloudflare is the biggest cancer on the web, fucking burn it.

    source
    • CatDogL0ver@lemmy.world ⁨7⁩ ⁨months⁩ ago

      It happened to me before until I did a Google search. It was my VPN web protection. It was too " over protective".

      Check your security settings, antivirus and VPN

      source
    • Dremor@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Linux and Firefox here. No problem at all with Cloudflare, despite having more or less as much privacy preserving add-on as possible. I even spoof my user agent to the latest Firefox ESR on Linux.

      Something’s muat be wrong with your setup.

      source
      • COASTER1921@lemmy.ml ⁨7⁩ ⁨months⁩ ago

        I suspect a lot of it comes down to your ISP. Like the original commentor I also frequently can’t pass CloudFlare turnstile when on Wifi, although refreshing the page a few times usually gets me through. Worst case on my phone’s hotspot I can much more consistently pass. It’s super annoying and combined with their recent DNS outage has totally ruined any respect I had for CloudFlare.

        Interesting video on the subject: youtu.be/SasXJwyKkMI

        source
      • drmoose@lemmy.world ⁨7⁩ ⁨months⁩ ago

        Thats not how it works. Cf uses thousands of variables to estimate a trust score and block people so just because it works for you doesn’t mean it works.

        source
        • -> View More Comments
    • dodos@lemmy.world ⁨7⁩ ⁨months⁩ ago

      I’m on Linux with Firefox and have never had that issue before (particularly nexusmods which I use regularly). Something else is probably wrong with your setup.

      source
      • jaemo@sh.itjust.works ⁨7⁩ ⁨months⁩ ago

        Thirded. All three (Linux, FF, nexus)

        ZERO ISSUES.

        source
      • drmoose@lemmy.world ⁨7⁩ ⁨months⁩ ago

        “Wrong with my setup” - thats not how internet works.

        I’m based in south east asia and often work on the road so IP rating probably is the final crutch in my fingerprint score.

        Either way this should be no way acceptible.

        source
        • -> View More Comments
      • Yeller_king@reddthat.com ⁨7⁩ ⁨months⁩ ago

        In my case, it’s usually the VPN.

        source
    • baronofclubs@lemmy.world ⁨7⁩ ⁨months⁩ ago

      omg ur a hacker

      Did you mean Edge on Windows? 'Cause if so, welcome in!

      source
  • Amberskin@europe.pub ⁨7⁩ ⁨months⁩ ago

    Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

    Isn’t that a literal computer crime?

    source
    • dinckelman@lemmy.world ⁨7⁩ ⁨months⁩ ago

      No-no, see. When an AI-first company does it, it’s actually called courageous innovation. Crimes are for poor people

      source
      • silicon@lemmy.world ⁨7⁩ ⁨months⁩ ago

        See: Facebook/Meta

        source
    • utopiah@lemmy.world ⁨7⁩ ⁨months⁩ ago

      puts on evil hat CloudFlare should DRM their protection then DMCA Perplexity and other US based “AI” companies to oblivion. Side effect, might break the Internet.

      source
      • iamdefinitelyoverthirteen@lemmy.world ⁨7⁩ ⁨months⁩ ago

        The Internet was already ruined, cloudflare is just bandaids on top of band aids.

        source
      • Deflated0ne@lemmy.world ⁨7⁩ ⁨months⁩ ago

        Worth it.

        source
  • poopkins@lemmy.world ⁨7⁩ ⁨months⁩ ago

    I’ve developed my own agent for assisting me with researching a topic I’m passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I’m a human using a web browser.

    So I use that as a signal that the website doesn’t want automated tools scraping their data. That’s fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

    source
    • IphtashuFitz@lemmy.world ⁨7⁩ ⁨months⁩ ago

      I hate to break it to you but not only does Cloudflare do this sort of thing, but so does Akamai, AWS, and virtually every other CDN provider out there. And far from being awful, it’s actually protecting the web.

      We use Akamai where I work, and they inform us in real time when a request comes from a bot, and they further classify it as one of a dozen or so bots (search engine crawlers, analytics bots, advertising bots, social networks, AI bots, etc). It also informs us if it’s somebody impersonating a well known bot like Google, etc. So we can easily allow search engines to crawl our site while blocking AI bots, bots impersonating Google, and so on.

      source
      • poopkins@lemmy.world ⁨7⁩ ⁨months⁩ ago

        What I meant with “things like this are awful for the web,” I meant that automation through AI is awful for the web. It takes away from the original content creators without any attribution and hits their bottom line.

        My story was supposed to be one about responsible AI, but somehow I screwed that up in my summary.

        source
  • kreskin@lemmy.world ⁨7⁩ ⁨months⁩ ago

    they cant get their ai to check a box that says “I am not a robot”? I’d think thatd be a first year comp sci student level task.

    source
    • 5gruel@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Recaptcha v2 does way more than check if the box was checked.

      stackoverflow.com/a/27299487

      source
      • kreskin@lemmy.world ⁨7⁩ ⁨months⁩ ago

        you’re not wrong, but it also allows more than 99.8% of the bot traffic through too on text challenges. Its like the TSA of website security. Its mostly there to keep the user busy while cloudflare places itself in a man in the middle of your encrypted connection to a third party. The only difference between cloudflare and a malicious attacker is cloudflares stated intention not to be evil. With that and 3 dollars I can buy myself a single hard shell taco from tacobell.

        source
    • drmoose@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Cloudflare actually fully fingerprints your browser and even sells that data. Thats your IP, TLS, operating system, full browser environment, installed extensions, GPU capabilities etc. It’s all tracked before the box even shows up, in fact the box is there to give the runtime more time to fingerprint you.

      source
      • tempest@lemmy.ca ⁨7⁩ ⁨months⁩ ago

        Yeah and the worst part is it doesn’t fucking work for the one thing it’s supposed to do.

        The only thing it does is stop the stupidest low effort scrapers and forces the good ones to use a browser.

        source
  • TheGrandNagus@lemmy.world ⁨7⁩ ⁨months⁩ ago

    Can’t believe I’ve lived to see Cloudflare be the good guys

    source
    • DreamlandLividity@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Lesser of two bad guys maybe?

      source
    • sunbeam60@lemmy.ml ⁨7⁩ ⁨months⁩ ago

      They’re not. They’re using this as an excuse to become paid gatekeepers of the internet as we know it. All that’s happening is that Cloudflare is using this to menuever into position where they can say “nice traffic you’ve got there - would be a shame if something happened to it”.

      AI companies are crap.

      What Cloudflare is doing here is also crap.

      And we’re cheering it on.

      source
  • Wispy2891@lemmy.world ⁨7⁩ ⁨months⁩ ago

    Here comes the ridiculous offer to buy Google chrome with money they don’t have: east scraping directly from the user source

    source
  • kittenzrulz123@lemmy.blahaj.zone ⁨7⁩ ⁨months⁩ ago

    Image

    source
  • tibi@lemmy.world ⁨7⁩ ⁨months⁩ ago

    You could say they are… Perplexed.

    source
  • Kissaki@feddit.org ⁨7⁩ ⁨months⁩ ago

    So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?

    source
    • ubergeek@lemmy.today ⁨7⁩ ⁨months⁩ ago

      And I’m assuming if the robots.txt state their UserAgent isn’t allowed to crawl, it obeys it, right? :P

      source
      • Kissaki@feddit.org ⁨7⁩ ⁨months⁩ ago

        No, as per the article, their argumentation is that they are not web crawlers generating an index, they are user-action-triggered agents working live for the user.

        source
        • -> View More Comments
    • drmoose@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Its not up to the hoster to decide whom to serve content. Web is intended to be user agent agnostic.

      source
    • lime@feddit.nu ⁨7⁩ ⁨months⁩ ago

      yeah it’s almost like there as already a system for this in place

      source
      • seraphine@lemmy.blahaj.zone ⁨7⁩ ⁨months⁩ ago

        THE CAKE DAY IS NOW. (i dont have an image at hand)

        source
        • -> View More Comments
  • NotASharkInAManSuit@lemmy.world ⁨7⁩ ⁨months⁩ ago

    That’s the entire point, dipshit. I wish we got one of the cool techno dystopias rather than this boring corporate idiot one.

    source
    • Dojan@pawb.social ⁨7⁩ ⁨months⁩ ago

      I’m still holding out for Stephen Hawking to mail out Demon Summoning programs.

      source
  • WolfLink@sh.itjust.works ⁨7⁩ ⁨months⁩ ago

    This is a nice CloudFlare ad

    source
    • pyre@lemmy.world ⁨7⁩ ⁨months⁩ ago

      yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.

      source
      • oppy1984@lemdro.id ⁨7⁩ ⁨months⁩ ago

        I’m out of the loop, what’s wrong with cloud flare?

        source
        • -> View More Comments
      • int32@lemmy.dbzer0.com ⁨7⁩ ⁨months⁩ ago

        DEATH TO CLOUDFLARE!

        source
        • -> View More Comments
  • LodeMike@lemmy.today ⁨7⁩ ⁨months⁩ ago

    Words cannot describe how much I hate this person

    source
  • kokesh@lemmy.world ⁨7⁩ ⁨months⁩ ago

    Is there some simply deployable PHP honeytrap for AI crawlers?

    source
    • ubergeek@lemmy.today ⁨7⁩ ⁨months⁩ ago

      You could probably route all requests to your site from them, back at themselves, so they DDoS themselves, and on top off it, cost them more because their endpoint needs to process things via their LLM.

      source
    • blargh513@sh.itjust.works ⁨7⁩ ⁨months⁩ ago

      Used to make tarpits with reverse proxies. Accept the connection and then set the responses for a few seconds before default TCP timeout. Doesn’t eat much resource as long as you have enough TCP connections and can reuse them effectively.

      source
  • frezik@lemmy.blahaj.zone ⁨7⁩ ⁨months⁩ ago

    Traveling snake oil salesman complains he can’t pick people’s locks.

    source
  • gravitas_deficiency@sh.itjust.works ⁨7⁩ ⁨months⁩ ago

    good, that means it’s working

    I’m gonna be frustrated (though not surprised) if the response is anything other than this.

    source
  • EtherWhack@lemmy.world ⁨7⁩ ⁨months⁩ ago

    Image

    source
  • fossilesque@mander.xyz ⁨7⁩ ⁨months⁩ ago

    I hate that these bots ruin my read it later app. :(

    source
  • ubergeek@lemmy.today ⁨7⁩ ⁨months⁩ ago

    Good. I went through my CF panel, and blocked some of those “AI Assistants” that by default were open, including Perplexity’s.

    source
    • _g_be@lemmy.world ⁨7⁩ ⁨months⁩ ago

      CF panel? You’re light bulb??

      source
      • ubergeek@lemmy.today ⁨7⁩ ⁨months⁩ ago

        CF == Cloudflare :)

        source
  • peoplebeproblems@midwest.social ⁨7⁩ ⁨months⁩ ago

    Well… Good.

    source
  • wosat@lemmy.world ⁨7⁩ ⁨months⁩ ago

    This is why companies like Perplexity and OpenAI are creating browsers.

    source
  • josefo@leminal.space ⁨7⁩ ⁨months⁩ ago

    I really hope Cloudflare doesn’t eventually evolve into a shitty ass company, so far I like them very much, and all this massive L for AI only improves my opinion on them.

    source
  • starchylemming@lemmy.world ⁨7⁩ ⁨months⁩ ago

    next step: cloudflare sends hit squads to blow up the source of these slimy data grabber attacks

    source
  • GissaMittJobb@lemmy.ml ⁨7⁩ ⁨months⁩ ago

    Skill issue. Cope and seethe

    source
    • sol6_vi@lemmy.makearmy.io ⁨7⁩ ⁨months⁩ ago

      this made me lol

      source
  • cupcakezealot@piefed.blahaj.zone ⁨7⁩ ⁨months⁩ ago

    rare cloudflare w

    source
    • boonhet@sopuli.xyz ⁨7⁩ ⁨months⁩ ago

      As far as security is concerned, their w’s are pretty common tbh. It’s just the whole centralization issue.

      source
-> View More Comments