Comment on Perplexity AI is complaining their plagiarism bot machine cannot bypass Cloudflare's firewall
BetaDoggo_@lemmy.world 10 hours agoPerplexity (an “AI search engine” company with 500 million in funding) can’t bypass cloudflare’s anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity’s scrapers because they consider them to be malicious traffic. Perplexity argues that their scraping is acceptable because it’s user initiated.
Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.
lividweasel@lemmy.world 8 hours ago
That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.
jballs@lemmy.world 7 hours ago
It’s worth giving the article a read. It seems that they’re not using the data for training, but for real-time results.
spankmonkey@lemmy.world 6 hours ago
They do it this way in case the data changed, similar to how a person would be viewing the current site. The training was for the basic understanding, the real time scraping is to account for changes.
It is also horribly inefficient and works like a small scale DDOS attack.