Pretty sure I’ve repeatedly heard about the crawlers completely ignoring robots.txt, so does Cloudflare really do that much?
Comment on Based on this graph, and this graph alone, guess at what time I completely blocked OpenAI crawlers
punrca@piefed.world 1 month ago
It’s best to use either Cloudflare (best IMO) or Anubis.
If you don’t want any AI bots, then you can setup Anubis (open source; requires JavaScript to be enabled by the end user): https://github.com/TecharoHQ/anubis
Cloudflare automatically setups robots.txt file to block “AI crawlers” (but you can setup to allow “AI search” for better SEO). Eg: https://blog.cloudflare.com/control-content-use-for-ai-training/#putting-up-a-guardrail-with-cloudflares-managed-robots-txt
Cloudflare also has an option of “AI labyrinth” to serve maze of fake data to AI bots who don’t respect robots.txt file.
AHemlocksLie@lemmy.zip 1 month ago
Sv443@sh.itjust.works 1 month ago
Like a lock on a door, it stops the vast majority but can’t do shit about the actual professional bad guys
FreedomAdvocate@lemmy.net.au 1 month ago
Cloudflare definitely can and does stop the vast majority of actual professional bad guys.
tomjuggler@lemmy.world 1 month ago
Yes, CloudFlare blocks agents completely if they ignore it’s restrictions. The key is scale - CloudFlare has a birds eye view of traffic patterns across millions of sites and can do statistical analysis to determine who is a bot.
I hate the necessity but it works
shane@feddit.nl 1 month ago
If you’re relying on Cloudflare are you even self-hosting?
CyberSeeker@discuss.tchncs.de 1 month ago
If you build a house, buy hire a guard for the front gate, do you even own the house?!
Impassionata@lemmy.world 1 month ago
If you use DNS at all, do you even own your street address!?!?
sudoer777@lemmy.ml 1 month ago
Yes if it’s tunneled to your self-hosting setup. With CGNAT you have to use similar services if you want to self-host.