cross-posted from: discuss.online/post/32165111
I realize my options are limited, but what about any robots.txt style steps? Thanks for any suggestions.
Submitted 1 day ago by kiol@discuss.online to technology@lemmy.world
cross-posted from: discuss.online/post/32165111
I realize my options are limited, but what about any robots.txt style steps? Thanks for any suggestions.
We learned this lesson in the 90s: If you put something on the (public) Internet, assume it will be scraped (and copied and used in various ways without your consent). If you don’t want that, don’t put it on the Internet.
There’s all sorts of clever things you can do to prevent scraping but none of them are 100% effective and all have negative tradeoffs.
For reference, the big AI players aren’t scraping the Internet to train their LLMs anymore. That creates too many problems, not the least of which is making yourself vulnerable to poisoning. If an AI is scraping your content at this point it’s either amateurs or they’re just indexing it like Google would (or both) so the AI knows where to find it without having to rely on 3rd parties like Google.
Remember: Scraping the Internet is everyone’s right. Trying to stop it is futile and only benefits the biggest of the big search engines/companies.
As someone with a public facing website, there are significant volumes of scraping still happening. But largely this appears to come out of South East Asia and South America and they take steps to hide who they are so it’s not clear who is doing it or why, but like you say it doesn’t appear to be OpenAI, Google, etc.
It doesn’t appear to be web search indexing, the scraping is aggressive and the volume will bring down a Lemmy server no matter how powerful the hardware.
There’s a tool for that: anubis.techaro.lol
Alternatively cloudflare also has scraper-protection offerings.
How well does Anubis actually work though? I have no issues with getting past it using puppeteer. But I’m also just dicking around at home not crawling an entire website.
Cloudflare for sure doesn’t work very well at blocking puppeteer or anything that runs a full browser. It’ll stop things that only rip the raw web page, but if you’re running JS and even halfway trying it’s not an issue to get past. And let’s be real. Do you want a crawler ripping 300k of text, or 400MB of page + images + videos + whatever other unnecessary garbage are on modern web pages?
The idea behind anubis is that a browser needs to deliver proof-of-work before accessing a website.
If you’re doing it one-off with puppeteer, your “browser” will happily do just that.
But if you’re scraping millions of websites, short challenges like this add up quickly and you’ll end up wasting lots of compute on them. As long as scrapers decide that those websites are not worth it anubis works.
Best practice right now is Anubis, and if you want to do a little bit extra and fight against robots.txt violating bots you could set up a infinite web of garbage with links to more garbage in a hidden part of your website. Be aware that it will cost you bandwith keeping them busy.
Nothing. You could sue for damages maybe, but you have to have been damaged by it somehow.
Hope this can help:
Not much you can do, if it’s on the internet it is public.
You can block some scrapers with PoW and that sort of thing, but you’ll never block all of them.
Not arguing against doing this as much as possible but I also recommend assuming your website will be scraped by bots and taking advantage of that to poison all the AI models you can. Feed in nonsense to the robots that isn’t public facing to humans on your website, have 5% of your content be blatant nonsense that asserts obviously untrue statements confidently but in a way that doesn’t disguise the clear intent of purposeful absurdity to human viewers.
See it as an opportunity not a vulnerability. Text is cheap, it doesn’t even really take up storage space on your website so why not?
I had a website that was set up for only my personal use. According to the logs the only activity I ever saw was my own. However, it involves a compromise. Obscurity at the cost of accessibility and convenience.
First, when I set up my SSL cert, I chose to get a wildcard subdomain cert. That way I could use a random subdomain name and it wouldn’t show up on crt.sh
Second, I use an uncommon port. My needs are very low so I don’t need to access my site all the time. The site is just a fun little hobby for myself. That means I’m not worried about accessing my site through places/businesses that block uncommon ports.
Accessing my site through a browser looks like: https//randomsubdomain.domainname.com:4444/
I’m going on the assumption that scrapers and crawlers are going to be searching common ports to maximize the number of sites they can access over wasting their time on searching uncommon ports.
If you are hosting on common ports (80, 443) then this isn’t going to be helpful at all and would likely require some sort of third party to manage scrapers and crawlers. For me, I get to enjoy my tiny corner of the internet with minimal effort and worry. Except my hard drive died recently so I’ll pick up again in January when I am not focused on other projects.
I’m sure given time, something will find my site. The game I’m playing is seeing how long it would take to find me.
Encrypted text is pretty much worthless to LLM. The hard part is getting decrypters to potential readers. RSS could get the text to readers; it could even get decrypters to readers as well … if someone was working on this problem.
kethali@lemmy.ca 1 day ago
Pretty sure they only way is to not have any public facing website… But obviously that’s not what you’re looking for :-(