Open Menu
AllLocalCommunitiesAbout
lotide
AllLocalCommunitiesAbout
Login

I was wrong about robots.txt

⁨87⁩ ⁨likes⁩

Submitted ⁨⁨1⁩ ⁨day⁩ ago⁩ by ⁨KarlHeinzSchwuke@feddit.org⁩ to ⁨technology@lemmy.world⁩

https://evgeniipendragon.com/posts/i-was-wrong-about-robots-txt/

source

Comments

Sort:hotnewtop
  • General_Effort@lemmy.world ⁨1⁩ ⁨day⁩ ago

    What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.

    source
    • Archr@lemmy.world ⁨1⁩ ⁨day⁩ ago

      I feel like most casual users would not make the connection of “crawlers” to link previews that they talk about it the article. Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.

      source
      • General_Effort@lemmy.world ⁨1⁩ ⁨day⁩ ago

        that is not how general news media has been talking about robots.txt.

        Ahh, yes. I think there is a lesson there.

        source
  • thedruid@lemmy.world ⁨1⁩ ⁨day⁩ ago

    So. If I can add something here for everyone’s benefit

    No search engine really obeys robots.txt

    Their publicly acknowledged crawlers do, but they have other crawlers that aren’t know that ignore the file.

    Google knows every inch of your site, allowed or not.

    See, just because a search engine says it doesn’t know, doesn’t mean it hasn’t crawled. Just doesn’t display the results based on your settings.

    source
    • ell1e@leminal.space ⁨1⁩ ⁨day⁩ ago

      And allowing the public crawler might also have it feed their AI: arstechnica.com/…/cloudflare-wants-google-to-chan…

      source
  • INeedMana@piefed.zip ⁨1⁩ ⁨day⁩ ago

    Huh. So in this case, the file actually is respected. Refreshing

    source
    • TeddE@lemmy.world ⁨12⁩ ⁨hours⁩ ago

      Kinda, but also not really. Any major tech player that has billions to lose will make a show of respecting robots.txt when presenting that information to third parties, lest they be exposed by basic journalism.

      However, they also have separate networks in R&D that sweep the net all the time and do not care about such restrictions. It’s theatre.

      And they’re still happy to punish people that have the gall to publicly decline their crawlers. Basically they can eat their cake and have it too.

      source
    • ell1e@leminal.space ⁨1⁩ ⁨day⁩ ago

      Often it is, but the problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

      For example, Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI. I imagine LinkedinBot, given it’s microsoft, will feed some other AI of theirs as well on top of the previews.

      Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn’t going to improve.

      source
      • General_Effort@lemmy.world ⁨1⁩ ⁨day⁩ ago

        Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

        False.

        source
        • -> View More Comments