Comment

Comment on Cloudflare announces AI Labyrinth, which uses AI-generated content to confuse and waste the resources of AI Crawlers and bots that ignore “no crawl” directives.

<- View Parent

dual_sport_dork@lemmy.world ⁨1⁩ ⁨year⁩ ago

Especially since the solution I cooked up for my site was to identify the incoming requests from these damn bots – which is not difficult, since they ignore all directives and sanity and try to slam your site with like 200+ requests per second, that makes 'em easy to spot – and simply IP ban them.

In fact, anybody who doesn’t exhibit a sane crawl rate gets blocked from my site automatically. For a while, most of them were coming from Russian IP address zones for some reason. These days Amazon is the worst offender, I guess their Rufus AI or whatever the fuck it is tries to pester other retail sites to “learn” about products rather than sticking to its own domain.

Fuck 'em. Route those motherfuckers right to /dev/null.

source

Sort:hotnew top

desktop_user@lemmy.blahaj.zone ⁨1⁩ ⁨year⁩ ago
the only problem with that solution being applied to generic websites is schools and institutions can have many legitimate users from one IP address and many sites don’t want a chance to accidentally block one.

source
- dual_sport_dork@lemmy.world ⁨1⁩ ⁨year⁩ ago
  This is fair in those applications. I only run an ecommerce web site, though, so that doesn’t come into play.
  
  source
morrowind@lemmy.ml ⁨1⁩ ⁨year⁩ ago
Cloudflare offers that too, but you can’t always tell

source
Buelldozer@lemmy.today ⁨1⁩ ⁨year⁩ ago

and try to slam your site with like 200+ requests per second

Your solution would do nothing to stop the crawlers that are operating 10ish rps. There’s ones out there operating at a mere 2rps but when multiple companies are doing it at the same time 24x7x365 it adds up.

Some incredibly talented people have been battling this since last year and your solution has been tried multiple times. It’s not effective in all instances and can require a LOT of manual intervention and SysAdmin time.

thelibre.news/foss-infrastructure-is-under-attack…

source
- confusedbytheBasics@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Yep. After you ban all the easy to spot ones you’re still left with far too many hard to ID bots. At least if your site is popular and large.
  
  source
- dual_sport_dork@lemmy.world ⁨1⁩ ⁨year⁩ ago
  It’s worked alright for me. Your mileage may vary.
  
  If someone is scraping my site at a low crawl rate I honestly don’t care so long as it doesn’t impact my performance for everyone else. If I hosted anything that was not just public knowledge or copy regurgitated verbatim from the bumf provided by the vendors of the brands I sell, I might oppose to it ideologically. But I don’t. So I don’t.
  
  If parallel crawling from multiple organizations legitimately becomes a concern for us I will have to get more creative. But thus far it hasn’t, and honestly just wholesale blocking Amazon from our shit instantly solved 90% of the problem.
  
  source
Flagstaff@programming.dev ⁨1⁩ ⁨year⁩ ago
Geez, that’s a lot of requests!

source
- dual_sport_dork@lemmy.world ⁨1⁩ ⁨year⁩ ago
  It sure is. Needless to say, I noticed it happening.
  
  source