Comment

Comment on Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal

<- View Parent

Eril@feddit.org ⁨6⁩ ⁨months⁩ ago

robots.txt, I guess? Yes, you can just ignore it, but you shouldn’t, if you develop a responsible web scraper.

source

Sort:hotnew top

hotpot8toe@lemmy.world ⁨6⁩ ⁨months⁩ ago
Also, rate limiting. A publicly accessible website doesn’t mean that it will allow scrapers to read millions of pages each week. They can easily identify and block scrapers because of the pattern of their activity. I don’t know if Reddit has rate-limiting, but I wouldn’t be surprised if they implement one.

source
reddig33@lemmy.world ⁨6⁩ ⁨months⁩ ago
Doesn’t seem legal that a robots.txt could pick and choose who scrapes. Seems like legally it would have to be all or nothing. Here’s hoping one of the search engines ignores it and makes it a legal case.

source
- Eril@feddit.org ⁨6⁩ ⁨months⁩ ago
  Actually currently it contains this:
  
  User-agent: * Disallow: /
  
  Well, that actually is a blanket ban for everyone, so something else must be at play here.
  
  source
  - starman@programming.dev ⁨6⁩ ⁨months⁩ ago
    merj.com/…/investigating-reddits-robots-txt-cloak…
    
    Reddit is serving different file to google
    
    source
    russjr08@bitforged.space ⁨6⁩ ⁨months⁩ ago
    
    We believe in the open internet, but we do not believe in the misuse of public content.
    
    That’s real rich, coming from Reddit.
    
    source
- capital@lemmy.world ⁨6⁩ ⁨months⁩ ago
  You’d probably feel differently if it were your service. Should you be able to control who scrapes your sites or should that be all or nothing?
  
  For the record, I fucking hate what the internet is becoming. I naively believed that even if shit got cordoned off into the walled gardens that are mobile phone apps, the web would remain as open as it was. This is a terrible sign of things to come.
  
  source
  - reddig33@lemmy.world ⁨6⁩ ⁨months⁩ ago
    No, I wouldn’t feel differently. In fact letting search engines scrape and point to your content is what leads people to your site. It’s free advertising. If you’re going to let one search engine in, you should let them all in.
    
    source
    capital@lemmy.world ⁨6⁩ ⁨months⁩ ago
    It’s not just search engines. Lots of people on Mastodon were using robots.txt to block ChatGPT (and any other LLM company they knew of) from scraping their sites/blogs.
    
    I disagree, to a point. I want to be able to control my services to the greatest extent possible, including picking who scrapes me.
    
    On the other hand, orgs as large as Google doing this poses a real threat to how the internet works right now which I hate.
    
    source