Comment on Anubis is awesome! Stopping (AI)crawlbots

blob42@lemmy.ml ⁨5⁩ ⁨days⁩ ago

I am planning to use it. For caddy users I came up some time ago with a solution that works after being bombarded by AI crawlers for weeks.

It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

Originally I only relied on the rate limited and noticed the AI bots kept trying whenever the rate limit was reset. Once I introduced data poisoning they all stopped :)

`git.blob42.xyz { @bot <<CEL header({‘Accept-Language’: ‘zh-CN’}) || header_regexp(‘User-Agent’, ‘(?i:(.bot.|.crawler.|.meta.|.google.|.microsoft.|.spider.))’) CEL

abort @bot


defender garbage {

    ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
  
}

rate_limit {
    zone dynamic_botstop {
        match {
            method GET
             # to use with defender
             #header X-RateLimit-Apply true
             #not header LetMeThrough 1
        }
        key {remote_ip}
        events 1500
        window 30s
        #events 10
        #window 1m
    }
}

reverse_proxy upstream.server:4242

handle_errors 429 {
    respond "429: Rate limit exceeded."
}

}`

source
Sort:hotnewtop