Comment

Comment on I was wrong about robots.txt

INeedMana@piefed.zip ⁨10⁩ ⁨months⁩ ago

Huh. So in this case, the file actually is respected. Refreshing

source

Sort:hotnew top

TeddE@lemmy.world ⁨10⁩ ⁨months⁩ ago
Kinda, but also not really. Any major tech player that has billions to lose will make a show of respecting robots.txt when presenting that information to third parties, lest they be exposed by basic journalism.

However, they also have separate networks in R&D that sweep the net all the time and do not care about such restrictions. It’s theatre.

And they’re still happy to punish people that have the gall to publicly decline their crawlers. Basically they can eat their cake and have it too.

source
ell1e@leminal.space ⁨10⁩ ⁨months⁩ ago
Often it is, but the problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

For example, Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI. I imagine LinkedinBot, given it’s microsoft, will feed some other AI of theirs as well on top of the previews.

Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn’t going to improve.

source
- General_Effort@lemmy.world ⁨10⁩ ⁨months⁩ ago
  
  Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.
  
  False.
  
  source
  - ell1e@leminal.space ⁨10⁩ ⁨months⁩ ago
    arstechnica.com/…/cloudflare-wants-google-to-chan…
    
    source
    General_Effort@lemmy.world ⁨10⁩ ⁨months⁩ ago
    Ok. That quotes a tweet by Cloudflare’s CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.
    
    Here’s Google technical documentation on its crawlers: developers.google.com/…/google-common-crawlers
    
    source
    -> View More Comments
  - cecilkorik@lemmy.ca ⁨10⁩ ⁨months⁩ ago
    Absolutely true. They’ll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their “data partner” wasn’t respecting robots.txt if they have to, which they won’t ever have to do because it’s literally impossible to detect and prove and realistically unenforceable.
    
    This is a company that removed it’s company motto of “Don’t be evil” because it found it too “limiting”. Don’t be naive.
    
    source
    General_Effort@lemmy.world ⁨10⁩ ⁨months⁩ ago
    That’s very different from what I called false.
    
    What you describe may happen, but probably not as much as you think. Much of that stuff is just not that valuable. Some personal, colloquial writing is necessary, but Google already pays Reddit. Other stuff is better obtained from torrents or shadow libraries like Anna’s Archive.
    
    source