Comment on What steps can be taken to prevent AI training and scraping of my public facing website?
riskable@programming.dev 1 day ago
We learned this lesson in the 90s: If you put something on the (public) Internet, assume it will be scraped (and copied and used in various ways without your consent). If you don’t want that, don’t put it on the Internet.
There’s all sorts of clever things you can do to prevent scraping but none of them are 100% effective and all have negative tradeoffs.
For reference, the big AI players aren’t scraping the Internet to train their LLMs anymore. That creates too many problems, not the least of which is making yourself vulnerable to poisoning. If an AI is scraping your content at this point it’s either amateurs or they’re just indexing it like Google would (or both) so the AI knows where to find it without having to rely on 3rd parties like Google.
Remember: Scraping the Internet is everyone’s right. Trying to stop it is futile and only benefits the biggest of the big search engines/companies.
Dave@lemmy.nz 1 day ago
As someone with a public facing website, there are significant volumes of scraping still happening. But largely this appears to come out of South East Asia and South America and they take steps to hide who they are so it’s not clear who is doing it or why, but like you say it doesn’t appear to be OpenAI, Google, etc.
It doesn’t appear to be web search indexing, the scraping is aggressive and the volume will bring down a Lemmy server no matter how powerful the hardware.