Comment

Comment on What steps can be taken to prevent AI training and scraping of my public facing website?

Dekkia@this.doesnotcut.it ⁨5⁩ ⁨months⁩ ago

There’s a tool for that: anubis.techaro.lol

Alternatively cloudflare also has scraper-protection offerings.

Sort:hotnew top

fuckwit_mcbumcrumble@lemmy.dbzer0.com ⁨5⁩ ⁨months⁩ ago
How well does Anubis actually work though? I have no issues with getting past it using puppeteer. But I’m also just dicking around at home not crawling an entire website.

Cloudflare for sure doesn’t work very well at blocking puppeteer or anything that runs a full browser. It’ll stop things that only rip the raw web page, but if you’re running JS and even halfway trying it’s not an issue to get past. And let’s be real. Do you want a crawler ripping 300k of text, or 400MB of page + images + videos + whatever other unnecessary garbage are on modern web pages?

source
- Dekkia@this.doesnotcut.it ⁨5⁩ ⁨months⁩ ago
  The idea behind anubis is that a browser needs to deliver proof-of-work before accessing a website.
  
  If you’re doing it one-off with puppeteer, your “browser” will happily do just that.
  
  But if you’re scraping millions of websites, short challenges like this add up quickly and you’ll end up wasting lots of compute on them. As long as scrapers decide that those websites are not worth it anubis works.
  
  source
  - snoons@lemmy.ca ⁨5⁩ ⁨months⁩ ago
    The only stable invidious instance I know of is now a heck of a lot more stable thanks to it also.
    
    source