Comment on Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal
reddig33@lemmy.world 1 month ago
I’m not understanding what stops a search engine from scraping a publicly accessible website. ?
Comment on Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal
reddig33@lemmy.world 1 month ago
I’m not understanding what stops a search engine from scraping a publicly accessible website. ?
Eril@feddit.org 1 month ago
robots.txt, I guess? Yes, you can just ignore it, but you shouldn’t, if you develop a responsible web scraper.
hotpot8toe@lemmy.world 1 month ago
Also, rate limiting. A publicly accessible website doesn’t mean that it will allow scrapers to read millions of pages each week. They can easily identify and block scrapers because of the pattern of their activity. I don’t know if Reddit has rate-limiting, but I wouldn’t be surprised if they implement one.
reddig33@lemmy.world 1 month ago
Doesn’t seem legal that a robots.txt could pick and choose who scrapes. Seems like legally it would have to be all or nothing. Here’s hoping one of the search engines ignores it and makes it a legal case.
Eril@feddit.org 1 month ago
Actually currently it contains this:
Well, that actually is a blanket ban for everyone, so something else must be at play here.
starman@programming.dev 1 month ago
merj.com/…/investigating-reddits-robots-txt-cloak…
Reddit is serving different file to google
capital@lemmy.world 1 month ago
You’d probably feel differently if it were your service. Should you be able to control who scrapes your sites or should that be all or nothing?
For the record, I fucking hate what the internet is becoming. I naively believed that even if shit got cordoned off into the walled gardens that are mobile phone apps, the web would remain as open as it was. This is a terrible sign of things to come.
reddig33@lemmy.world 1 month ago
No, I wouldn’t feel differently. In fact letting search engines scrape and point to your content is what leads people to your site. It’s free advertising. If you’re going to let one search engine in, you should let them all in.