Comment

Comment on Perplexity AI is complaining their plagiarism bot machine cannot bypass Cloudflare's firewall

Perplexity (an “AI search engine” company with 500 million in funding) can’t bypass cloudflare’s anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity’s scrapers because they consider them to be malicious traffic. Perplexity argues that their scraping is acceptable because it’s user initiated.

Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

source

Sort:hotnew top

lividweasel@lemmy.world ⁨9⁩ ⁨months⁩ ago

…and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

source
- rdri@lemmy.world ⁨9⁩ ⁨months⁩ ago
  First we complain that AI steals and trains on our data. Then we complain when it doesn’t train. Cool.
  
  source
  - ubergeek@lemmy.today ⁨9⁩ ⁨months⁩ ago
    I think it boils down to “consent” and “remuneration”.
    
    I run a website, that I do not consent to being accessed for LLMs. However, should LLMs use my content, I should be compensated for such use.
    
    So, these LLM startups ignore both consent, and the idea of remuneration.
    
    Most of these concepts have already been figured out for the purpose of law, if we consider websites much akin to real estate: Then, the typical trespass laws, compensatory usage, and hell, even eminent domain if needed ie, a city government can “take over” the boosted post feature to make sure alerts get pushed as widely and quickly as possible.
    
    source
    rdri@lemmy.world ⁨9⁩ ⁨months⁩ ago
    That all sounds very vague to me, and I don’t expect it to be captured properly by law any time soon. Being accessed for LLM? What does it mean for you and how is it different from being accessed by a user? Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?
    
    Is it okay for a person to access your site? Is it okay for a script written by that person to fetch data every day automatically? Would it be okay for a user to dump a page of your site with a headless browser? Would it be okay to let an LLM take a look at it to extract info required by a user? Have you heard about changedetection.io project? If some of these sound unfair to you, you might want to put a DRM on your data or something.
    
    Would you expect a compensation from me after reading your comment?
    
    source
    -> View More Comments
- spankmonkey@lemmy.world ⁨9⁩ ⁨months⁩ ago
  They do it this way in case the data changed, similar to how a person would be viewing the current site. The training was for the basic understanding, the real time scraping is to account for changes.
  
  It is also horribly inefficient and works like a small scale DDOS attack.
  
  source
- jballs@lemmy.world ⁨9⁩ ⁨months⁩ ago
  It’s worth giving the article a read. It seems that they’re not using the data for training, but for real-time results.
  
  source