Comment on What steps can be taken to prevent AI training and scraping of my public facing website?
Dekkia@this.doesnotcut.it 3 days ago
There’s a tool for that: anubis.techaro.lol
Alternatively cloudflare also has scraper-protection offerings.
Comment on What steps can be taken to prevent AI training and scraping of my public facing website?
Dekkia@this.doesnotcut.it 3 days ago
There’s a tool for that: anubis.techaro.lol
Alternatively cloudflare also has scraper-protection offerings.
fuckwit_mcbumcrumble@lemmy.dbzer0.com 3 days ago
How well does Anubis actually work though? I have no issues with getting past it using puppeteer. But I’m also just dicking around at home not crawling an entire website.
Cloudflare for sure doesn’t work very well at blocking puppeteer or anything that runs a full browser. It’ll stop things that only rip the raw web page, but if you’re running JS and even halfway trying it’s not an issue to get past. And let’s be real. Do you want a crawler ripping 300k of text, or 400MB of page + images + videos + whatever other unnecessary garbage are on modern web pages?
Dekkia@this.doesnotcut.it 3 days ago
The idea behind anubis is that a browser needs to deliver proof-of-work before accessing a website.
If you’re doing it one-off with puppeteer, your “browser” will happily do just that.
But if you’re scraping millions of websites, short challenges like this add up quickly and you’ll end up wasting lots of compute on them. As long as scrapers decide that those websites are not worth it anubis works.
snoons@lemmy.ca 3 days ago
The only stable invidious instance I know of is now a heck of a lot more stable thanks to it also.