Comment on Anubis is awesome! Stopping (AI)crawlbots
blob42@lemmy.ml 5 days ago
I am planning to use it. For caddy users I came up some time ago with a solution that works after being bombarded by AI crawlers for weeks.
It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.
Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.
Originally I only relied on the rate limited and noticed the AI bots kept trying whenever the rate limit was reset. Once I introduced data poisoning they all stopped :)
`git.blob42.xyz { @bot <<CEL header({‘Accept-Language’: ‘zh-CN’}) || header_regexp(‘User-Agent’, ‘(?i:(.bot.|.crawler.|.meta.|.google.|.microsoft.|.spider.))’) CEL
abort @bot defender garbage { ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8 } rate_limit { zone dynamic_botstop { match { method GET # to use with defender #header X-RateLimit-Apply true #not header LetMeThrough 1 } key {remote_ip} events 1500 window 30s #events 10 #window 1m } } reverse_proxy upstream.server:4242 handle_errors 429 { respond "429: Rate limit exceeded." }
}`
azertyfun@sh.itjust.works 4 days ago
That’s an ARIN block according to Wikipedia so North America, under Northen Telecom until 2010. It does look like Alibaba operate many networks under that
/8
, but I very much doubt it’s the whole/8
which would be worth a lot; a/16
is apparently worth around $3-4M, so a/8
can be extrapolated to be worth upwards of a billion dollars! I doubt they put all their eggs into that particular basket. So you’re probably matching a lot of innocent North American IPs with this.blob42@lemmy.ml 4 days ago
Right I must have just blanket banned the whole /8 to be sure alibaba cloud is included. Did some time ago so I forgot
Cozog@feddit.dk 4 days ago
When I blocked Alibaba, the AI crawlers immediately started coming from a different cloud provider (Huawei, I believe), and when I blocked that, it happened again. Eventually the crawlers started coming from North American and then European cloud providers.
Due to lack of time to change my setup to accommodate Anubis, I had to temporarily move my site behind Cloudflare (where it sadly still is).
blob42@lemmy.ml 3 days ago
We need a decentralized community owned cloudflare alternative. Anubis looks on good track.