I was wrong about robots.txt

⁨0⁩ ⁨likes⁩

Submitted ⁨⁨10⁩ ⁨months⁩ ago⁩ by ⁨KarlHeinzSchwuke@feddit.org⁩ to ⁨technology@lemmy.world⁩

https://evgeniipendragon.com/posts/i-was-wrong-about-robots-txt/

source

Comments

Sort:hotnew top

General_Effort@lemmy.world ⁨10⁩ ⁨months⁩ ago
What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.

source
- Archr@lemmy.world ⁨10⁩ ⁨months⁩ ago
  I feel like most casual users would not make the connection of “crawlers” to link previews that they talk about it the article. Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
  
  source
  - General_Effort@lemmy.world ⁨10⁩ ⁨months⁩ ago
    
    that is not how general news media has been talking about robots.txt.
    
    Ahh, yes. I think there is a lesson there.
    
    source
thedruid@lemmy.world ⁨10⁩ ⁨months⁩ ago
So. If I can add something here for everyone’s benefit

No search engine really obeys robots.txt

Their publicly acknowledged crawlers do, but they have other crawlers that aren’t know that ignore the file.

Google knows every inch of your site, allowed or not.

See, just because a search engine says it doesn’t know, doesn’t mean it hasn’t crawled. Just doesn’t display the results based on your settings.

source
- ell1e@leminal.space ⁨10⁩ ⁨months⁩ ago
  And allowing the public crawler might also have it feed their AI: arstechnica.com/…/cloudflare-wants-google-to-chan…
  
  source
INeedMana@piefed.zip ⁨10⁩ ⁨months⁩ ago
Huh. So in this case, the file actually is respected. Refreshing

source
- TeddE@lemmy.world ⁨10⁩ ⁨months⁩ ago
  Kinda, but also not really. Any major tech player that has billions to lose will make a show of respecting robots.txt when presenting that information to third parties, lest they be exposed by basic journalism.
  
  However, they also have separate networks in R&D that sweep the net all the time and do not care about such restrictions. It’s theatre.
  
  And they’re still happy to punish people that have the gall to publicly decline their crawlers. Basically they can eat their cake and have it too.
  
  source
- ell1e@leminal.space ⁨10⁩ ⁨months⁩ ago
  Often it is, but the problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.
  
  For example, Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI. I imagine LinkedinBot, given it’s microsoft, will feed some other AI of theirs as well on top of the previews.
  
  Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn’t going to improve.
  
  source
  - General_Effort@lemmy.world ⁨10⁩ ⁨months⁩ ago
    
    Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.
    
    False.
    
    source
    -> View More Comments