Open Menu
AllLocalCommunitiesAbout
lotide
AllLocalCommunitiesAbout
Login

Perplexity AI is complaining their plagiarism bot machine cannot bypass Cloudflare's firewall

⁨876⁩ ⁨likes⁩

Submitted ⁨⁨3⁩ ⁨weeks⁩ ago⁩ by ⁨Davriellelouna@lemmy.world⁩ to ⁨technology@lemmy.world⁩

https://www.searchenginejournal.com/perplexity-says-cloudflare-is-blocking-legitimate-ai-assistants/552927/

source

Comments

Sort:hotnewtop
  • Glitchvid@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

    When a firm outright admits to bypassing or trying to bypass measures taken to keep them out, you think that would be a slam dunk case of unauthorized access under the CFAA with felony enhancements.

    source
    • GamingChairModel@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

      Fuck that. I don’t need prosecutors and the courts to rule that accessing publicly available information in a way that the website owner doesn’t want is literally a crime. That logic would extend to ad blockers and editing HTML/js in an “inspect element” tag.

      source
      • EncryptKeeper@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

        That logic would not extend to ad blockers, as the point of concern is gaining unauthorized access to a computer system or asset. Blocking ads would not be considered gaining unauthorized access to anything. In fact it would be the opposite of that.

        source
        • -> View More Comments
      • kibiz0r@midwest.social ⁨2⁩ ⁨weeks⁩ ago

        They already prosecute people under the unauthorized access provision. They just don’t prosecute rich people under it.

        source
        • -> View More Comments
    • jve@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

      Right? Isn’t this a textbook DMCA violation?

      source
      • WhyJiffie@sh.itjust.works ⁨2⁩ ⁨weeks⁩ ago

        for us, not for them. wait until they argue in court that actually its us at fault and we need to provide access or else

        source
  • floquant@lemmy.dbzer0.com ⁨3⁩ ⁨weeks⁩ ago

    It’s difficult to be a shittier company than OpenAI, but Perplexity seems to be trying hard.

    source
    • Brunbrun6766@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

      Step 1, SOMEHOW find a more punchable face than Altman

      source
      • Tollana1234567@lemmy.today ⁨2⁩ ⁨weeks⁩ ago

        put META android zuckerberg on.

        source
        • -> View More Comments
      • scarabic@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

        Altman’s face looks like it’s already been punched

        source
  • WolfLink@sh.itjust.works ⁨2⁩ ⁨weeks⁩ ago

    This is a nice CloudFlare ad

    source
    • pyre@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

      yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.

      source
      • int32@lemmy.dbzer0.com ⁨2⁩ ⁨weeks⁩ ago

        DEATH TO CLOUDFLARE!

        source
        • -> View More Comments
      • oppy1984@lemdro.id ⁨2⁩ ⁨weeks⁩ ago

        I’m out of the loop, what’s wrong with cloud flare?

        source
        • -> View More Comments
  • Kissaki@feddit.org ⁨2⁩ ⁨weeks⁩ ago

    So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?

    source
    • lime@feddit.nu ⁨2⁩ ⁨weeks⁩ ago

      yeah it’s almost like there as already a system for this in place

      source
      • seraphine@lemmy.blahaj.zone ⁨2⁩ ⁨weeks⁩ ago

        THE CAKE DAY IS NOW. (i dont have an image at hand)

        source
        • -> View More Comments
    • ubergeek@lemmy.today ⁨2⁩ ⁨weeks⁩ ago

      And I’m assuming if the robots.txt state their UserAgent isn’t allowed to crawl, it obeys it, right? :P

      source
      • Kissaki@feddit.org ⁨2⁩ ⁨weeks⁩ ago

        No, as per the article, their argumentation is that they are not web crawlers generating an index, they are user-action-triggered agents working live for the user.

        source
        • -> View More Comments
    • drmoose@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

      Its not up to the hoster to decide whom to serve content. Web is intended to be user agent agnostic.

      source
  • JeeBaiChow@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

    Uh… good?

    source
  • Amberskin@europe.pub ⁨2⁩ ⁨weeks⁩ ago

    Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

    Isn’t that a literal computer crime?

    source
    • dinckelman@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

      No-no, see. When an AI-first company does it, it’s actually called courageous innovation. Crimes are for poor people

      source
      • silicon@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

        See: Facebook/Meta

        source
    • utopiah@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

      puts on evil hat CloudFlare should DRM their protection then DMCA Perplexity and other US based “AI” companies to oblivion. Side effect, might break the Internet.

      source
      • Deflated0ne@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

        Worth it.

        source
      • iamdefinitelyoverthirteen@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

        The Internet was already ruined, cloudflare is just bandaids on top of band aids.

        source
  • frezik@lemmy.blahaj.zone ⁨2⁩ ⁨weeks⁩ ago

    Traveling snake oil salesman complains he can’t pick people’s locks.

    source
  • tibi@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

    You could say they are… Perplexed.

    source
  • cupcakezealot@piefed.blahaj.zone ⁨3⁩ ⁨weeks⁩ ago

    rare cloudflare w

    source
    • boonhet@sopuli.xyz ⁨2⁩ ⁨weeks⁩ ago

      As far as security is concerned, their w’s are pretty common tbh. It’s just the whole centralization issue.

      source
  • NotASharkInAManSuit@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

    That’s the entire point, dipshit. I wish we got one of the cool techno dystopias rather than this boring corporate idiot one.

    source
    • Dojan@pawb.social ⁨2⁩ ⁨weeks⁩ ago

      I’m still holding out for Stephen Hawking to mail out Demon Summoning programs.

      source
  • EtherWhack@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

    Image

    source
  • sylver_dragon@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

    You’d think that a competent technology company, with their own AI would be able to figure out a way to spoof Cloudflare’s checks. I’d still think that.

    source
    • spankmonkey@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

      Or find a more efficient way to manage data, since their current approach is basically DDOSing the internet for training data and for responding to user interactions.

      source
      • flux@lemmy.ml ⁨2⁩ ⁨weeks⁩ ago

        This is not about training data, though.

        Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

        Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

        I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

        Though technically making all this happen flawlessly is quite a big task.

        source
        • -> View More Comments
    • Quill7513@slrpnk.net ⁨3⁩ ⁨weeks⁩ ago

      see, but they’re not competent. further, they don’t care. most of these ai companies are snake oil. they’re selling you a solution that doesn’t meaningfully solve a problem. their main way of surviving is saying “this is what it can do now, just imagine what it can do if you invest money in my company.”

      they’re scammers, the lot of them, running ponzi schemes with our money. if the planet dies for it, that’s no concern of theirs. ponzi schemes require the schemer to have no long term plan, just a line of credit that they can keep drawing from until they skip town before the tax collector comes

      source
    • lemmyng@piefed.ca ⁨3⁩ ⁨weeks⁩ ago

      Perplexity: "But that would cost us moneeyyyy!"

      source
  • ubergeek@lemmy.today ⁨2⁩ ⁨weeks⁩ ago

    Good. I went through my CF panel, and blocked some of those “AI Assistants” that by default were open, including Perplexity’s.

    source
    • _g_be@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

      CF panel? You’re light bulb??

      source
      • ubergeek@lemmy.today ⁨2⁩ ⁨weeks⁩ ago

        CF == Cloudflare :)

        source
  • Electricd@lemmybefree.net ⁨2⁩ ⁨weeks⁩ ago

    I don’t like cloudflare but it’s nice that they allow people to stop AI scrapping if they want to

    source
    • tempest@lemmy.ca ⁨2⁩ ⁨weeks⁩ ago

      CloudFlare has become an Internet protection racket and I’m not happy about it.

      source
      • Laser@feddit.org ⁨2⁩ ⁨weeks⁩ ago

        It’s been this from the very beginning. But they don’t fit the definition of a protection racket as they’re not the ones attacking you if you don’t pay up. So they’re more like a security company that has no competitors due to the needed investment to operate.

        source
      • Electricd@lemmybefree.net ⁨2⁩ ⁨weeks⁩ ago

        they’re good at it but damn, having a company being MITM feels so wrong

        source
        • -> View More Comments
  • TheGrandNagus@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

    Can’t believe I’ve lived to see Cloudflare be the good guys

    source
    • sunbeam60@lemmy.ml ⁨2⁩ ⁨weeks⁩ ago

      They’re not. They’re using this as an excuse to become paid gatekeepers of the internet as we know it. All that’s happening is that Cloudflare is using this to menuever into position where they can say “nice traffic you’ve got there - would be a shame if something happened to it”.

      AI companies are crap.

      What Cloudflare is doing here is also crap.

      And we’re cheering it on.

      source
    • DreamlandLividity@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

      Lesser of two bad guys maybe?

      source
  • iAvicenna@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

    ask AI how to do it?

    source
  • peoplebeproblems@midwest.social ⁨2⁩ ⁨weeks⁩ ago

    Well… Good.

    source
  • wosat@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

    This is why companies like Perplexity and OpenAI are creating browsers.

    source
  • gravitas_deficiency@sh.itjust.works ⁨2⁩ ⁨weeks⁩ ago

    good, that means it’s working

    I’m gonna be frustrated (though not surprised) if the response is anything other than this.

    source
  • kittenzrulz123@lemmy.blahaj.zone ⁨2⁩ ⁨weeks⁩ ago

    Image

    source
  • GissaMittJobb@lemmy.ml ⁨3⁩ ⁨weeks⁩ ago

    Skill issue. Cope and seethe

    source
    • sol6_vi@lemmy.makearmy.io ⁨2⁩ ⁨weeks⁩ ago

      this made me lol

      source
  • kescusay@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

    I set up a WAF for my company’s publicly facing developer portal to block out bot traffic from assholes like these guys. It reduced bot traffic to the site by something like - I kid you not - 99.999%.

    Fucking data vultures.

    source
  • Ermiar@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

    Image

    source
  • Ekybio@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

    Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

    source
    • spankmonkey@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

      AI crawlers tend to overwhelm websites by doing the least efficient scraping of data possible, basically DDOSing a huge portion of the internet. Perplexity already scraped the net for training data and is now hammering it inefficiently for searches.

      Cloudflare is just trying to keep the bots from overwhelming everything.

      source
    • panda_abyss@lemmy.ca ⁨3⁩ ⁨weeks⁩ ago

      Cloudflare runs as a CDN/cache/gateway service in front of a ton of websites. Their service is to help protect against DDOS and malicious traffic.

      A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can do, so it actually becomes a revenue source for them.

      This is a response to that from Perplexity who run an AI search company. I don’t actually know how their service works, but they were specifically called out in the announcement and Cloudflare accused them of “stealth scraping” and ignoring robots.txt and other things.

      source
      • very_well_lost@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

        A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

        I think it’s also worth pointing out that all of the big AI companies are burning through cash at an absolutely astonishing rate, and none of them are anywhere close to being profitable. So pay-walling the data they use is probably gonna be painful for their already-tortured bottom line (good).

        source
        • -> View More Comments
      • _cryptagion@lemmy.dbzer0.com ⁨3⁩ ⁨weeks⁩ ago

        It should be pointed out that Cloudflare didn’t say they were going to block AI traffic, they give you the option to. The service is a free opt-in for people who want it.

        source
      • nutsack@lemmy.dbzer0.com ⁨3⁩ ⁨weeks⁩ ago

        they don’t outright block ai crawlers. but they added some new tools and options for managing or blocking ai bot traffic which the cloudflare customer can choose to use or to not use.

        im running a free educational resource and i let the crawlers hit my site all they want because its useful knowledge and it’s served to them from cloudflare’s free tier cache.

        source
      • RogueBanana@piefed.zip ⁨3⁩ ⁨weeks⁩ ago

        But the website owner can still choose to continue blocking them right? Without using additional stuff like Anubis that is.

        source
    • BetaDoggo_@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

      Perplexity (an “AI search engine” company with 500 million in funding) can’t bypass cloudflare’s anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity’s scrapers because they consider them to be malicious traffic. Perplexity argues that their scraping is acceptable because it’s user initiated.

      Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

      source
      • lividweasel@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

        …and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

        That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

        source
        • -> View More Comments
  • BaroqueInMind@piefed.social ⁨3⁩ ⁨weeks⁩ ago

    Cry more, Perplexity.

    source
  • SugarCatDestroyer@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

    It seems like it’s some kind of distraction to make people think things aren’t as bad as they really are, it just sounds too far-fetched to me.

    source
    • StocktonCrushed@sh.itjust.works ⁨2⁩ ⁨weeks⁩ ago
      [deleted]
      source
      • SugarCatDestroyer@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

        So that he doesn’t have to run after the rabbits, he will learn to raise them and manage them with a fake smile, providing them with a stable life lol.

        Well, I think the thing is that we still live by the law: the strong do what they want, and the weak just whine and complain.

        source
  • kreskin@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

    they cant get their ai to check a box that says “I am not a robot”? I’d think thatd be a first year comp sci student level task.

    source
    • drmoose@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

      Cloudflare actually fully fingerprints your browser and even sells that data. Thats your IP, TLS, operating system, full browser environment, installed extensions, GPU capabilities etc. It’s all tracked before the box even shows up, in fact the box is there to give the runtime more time to fingerprint you.

      source
      • tempest@lemmy.ca ⁨2⁩ ⁨weeks⁩ ago

        Yeah and the worst part is it doesn’t fucking work for the one thing it’s supposed to do.

        The only thing it does is stop the stupidest low effort scrapers and forces the good ones to use a browser.

        source
    • 5gruel@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

      Recaptcha v2 does way more than check if the box was checked.

      stackoverflow.com/a/27299487

      source
      • kreskin@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

        you’re not wrong, but it also allows more than 99.8% of the bot traffic through too on text challenges. Its like the TSA of website security. Its mostly there to keep the user busy while cloudflare places itself in a man in the middle of your encrypted connection to a third party. The only difference between cloudflare and a malicious attacker is cloudflares stated intention not to be evil. With that and 3 dollars I can buy myself a single hard shell taco from tacobell.

        source
  • FauxLiving@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

    The amount of people just reacting to the headline in the comments on these kinds of articles is always surprising.

    Your browser acts as an agent too, you don’t manually visit every script link, image source and CSS file. Everyone has experienced how annoying it is to have your browser be targeted by Cloudflare.

    There’s a pretty major difference between a human user loading a page and having it summarized and a bot that is scraping 1500 pages/second.

    Cheering for Cloudflare to be the arbiter of what technologies are allowed is incredibly short sighted. They exist to provide their clients with services, including bot mitigation. But a user initiated operation isn’t the same as a bot.

    Which is the point of the article and the article’s title.

    It isn’t clear why OP had to alter the headline,p to bait the anti-ai crowd.

    source
    • spankmonkey@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

      But a user initiated operation isn’t the same as a bot.

      Oh fuck off with that AI company propaganda.

      The AI companies already overwhelmed sites to get training data and are repeating their shitty scraping practices when users interact with their AI. It’s the same fucking thing.

      Web crawlers for search engines don’t scrape pages every time a user searches like AI does. Both web crawlers and scrapers are bots, and how a human initiates their operation, scheduled or not, doesn’t matter as much as the fact that they do things very differently and only one of the two respects robots.txt.

      source
      • FauxLiving@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

        There’s no difference in server load between a user looking at a page and a user using an AI tool to summarize the page.

        The AI companies already overwhelmed sites to get training data and are repeating their shitty scraping practices when users interact with their AI. It’s the same fucking thing.

        You either didn’t read the article or are deliberately making bad faith arguments. The entire point of the article is that the traffic that they’re referring to is initiated by a user, just like when you type an address into your browser’s address bar.

        This traffic, initiated by a user, creates the same server load as that same user loading the page in a browser.

        Yes, mass scraping of web pages creates a bunch of server load. This was the case before AI was even a thing.

        This situation is like Cloudflare presenting was a captcha in order to load each individual image, css or JavaScript asset into a web browser because bot traffic pretends to be a browser.

        I don’t think it’s too hard to understand that a bot pretending to be a browser and a human operated browser are two completely different things and classifying them as the same (and captchaing them) would be a classification error.

        This is exactly the same kind of error. Even if you personally believe that users using AI tools should be blocked, not everyone has the same opinion. If Cloudflare can’t distinguish between bot requests and human requests then their customers can’t opt out and allow their users to use AI tools even if they want to.

        source
        • -> View More Comments
    • OmgItBurns@discuss.online ⁨2⁩ ⁨weeks⁩ ago

      I think part of the issue is that it does act more like a search engine crawler than a traditional user. A lot of sites rely on real human traffic for revenue (serving ads, requests to sign up for Patreon, using affiliate links, etc) that gets bypassed by these bots. Hell in some cases the people running the sites are just looking for interaction. So while there is a spike in traffic, and potentially cost, the people running these sites aren’t getting the benefit of that traffic.

      Basically these have the same issues as the summaries that Google does in their search results but, potentially, have much larger impact on the host’s bandwidth

      source
    • _cryptagion@lemmy.dbzer0.com ⁨3⁩ ⁨weeks⁩ ago

      Cheering for Cloudflare to be the arbiter of what technologies are allowed is incredibly short sighted. They exist to provide their clients with services, including bot mitigation.

      Well I suppose it’s a good thing then that the anti-AI shield is opt-in, and Cloudflare isn’t making any decisions for anyone on whether or not AI scrapers get to visit their pages. That little bit of context makes your entire argument fall apart.

      source
      • FauxLiving@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

        It isn’t opt in.

        You can block all bot page scraping, and also block user initiated AI tools or you can block no traffic.

        There isn’t an option to block bot page scraping but allow user initiated AI tools.

        Because, as the article points out, Cloudflare is not able to distinguish between the two

        source
        • -> View More Comments
    • unpossum@sh.itjust.works ⁨2⁩ ⁨weeks⁩ ago

      Thank you for trying to fight the irrational anti-AI brainrot on lemmy! It’s probably a lost cause, but your efforts are appreciated :)

      source
      • FauxLiving@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

        It’s an uphill battle. Lots of motivated reasoning and bad faith arguments

        source
    • ubergeek@lemmy.today ⁨2⁩ ⁨weeks⁩ ago

      Cheering for Cloudflare to be the arbiter of what technologies are allowed is incredibly short sighted

      Except, they don’t. It’s a toggle, available to users, and by default, allows Perplexity’s scraping.

      source
    • HarkMahlberg@kbin.earth ⁨3⁩ ⁨weeks⁩ ago

      In a better timeline, we wouldn't need to cheer the victory of one megacorporation over another, they would both be the losers. But also people are still capable of holding two thoughts simultaneously.

      For instance, we'd all be happy to see Apple lose the Epic Games lawsuit and be forced out of their monopoly on app stores on iOS. But those same people are aware it would allow Epic to continue being a disgusting company.

      bait the anti-ai crowd

      Oh I see lol

      source
      • FauxLiving@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

        What does any of that have to do with the fact that Cloudflare isn’t able to classify traffic in order to distinguish between human user generated traffic and mass scraping bot traffic?

        If they’re incapable of distinguishing the two, then their customers are having legitimate user requests blocked by Cloudflare with no ability to opt out.

        Oh I see lol

        Yeah, I think people who’re unable to think rationally about a problem because they made up their mind before knowing any of the details are intellectually lazy.

        source
  • LodeMike@lemmy.today ⁨2⁩ ⁨weeks⁩ ago

    Words cannot describe how much I hate this person

    source
  • panda_abyss@lemmy.ca ⁨3⁩ ⁨weeks⁩ ago

    I actually agree with them

    This feels like cloudflare trying to collect rent from both sides instead of doing what’s best for the website owners.

    There is a problem with AI crawlers, but these technologies are essentially doing a search, fetching a several pages, scanning/summarizing them, then presenting the findings to the user.

    I don’t really think that’s wrong, it’s just a faster version of rummaging through the SEO shit you do when you Google something.

    source
-> View More Comments