Comment on Based on this graph, and this graph alone, guess at what time I completely blocked OpenAI crawlers

<- View Parent
hoppolito@mander.xyz ⁨1⁩ ⁨week⁩ ago

The code forge is gitea/forgejo, and the proxy in front used to be traefik. I tried fail2ban in front for a while as well but the issue was that everything appeared to come from different IPs.

The bots were also hitting my other public services pretty hard but nowhere near as bad. I think it’s a combination of 2 things:

A small interesting observation I made was that they also seemed to ‘focus’ on specific projects. So my guess would be you get unlucky once by having a large-ish repo targeted for crawling and then they just get stuck in there and get lost in the maze of possible pages. On the other hand it may make targeted blocking for certain routes more feasible…

I think there’s a lot to be gained here by everybody pooling their knowledge, but on the other hand it’s also an annoying topic and most selfhosting (including mine) is afaik done as a hobby, so most peeps will slap an Anubis-like PoW in front and call it a day.

source
Sort:hotnewtop