Open Menu
AllLocalCommunitiesAbout
lotide
AllLocalCommunitiesAbout
Login

Wikipedia is giving AI developers its data to fend off bot scrapers

⁨99⁩ ⁨likes⁩

Submitted ⁨⁨4⁩ ⁨weeks⁩ ago⁩ by ⁨Tea@programming.dev⁩ to ⁨technology@lemmy.world⁩

https://enterprise.wikimedia.com/blog/kaggle-dataset/

source

Comments

Sort:hotnewtop
  • MCasq_qsaCJ_234@lemmy.zip ⁨4⁩ ⁨weeks⁩ ago

    I just feel like OpenAI might accept this and ignore the website, although it’s very unlikely they will actually do that.

    source
  • Geodad@lemm.ee ⁨4⁩ ⁨weeks⁩ ago

    Is there not some way to just blacklist the AI domain or IP range?

    source
    • Monument@lemmy.sdf.org ⁨4⁩ ⁨weeks⁩ ago

      No, because there isn’t a single IP range or user agent, and many developers are going to lengths to defeat anti-scraping measures, which include user agent spoofing as well as vpns and the like to mask the source of the traffic.

      source
    • devfuuu@lemmy.world [bot] ⁨4⁩ ⁨weeks⁩ ago

      If you read the few artucles about people being attacked by AI in the recent months they all tell the same story: it’s not possible. The AI companies are targetting on purpose other sites and working non stop to actively avoid any kind of blocking that could be active. They rotate IPs regularly, they change User agents, they ignore robots.txt, deduplicate requests over bunch of ips, if they detect they are being blocked they start only doing one request in each ip, they change user agents the moment they detect one is being blocked, etc etc etc.

      source
      • baines@lemmy.cafe ⁨4⁩ ⁨weeks⁩ ago

        whitelists and the end of anonymity

        source
        • -> View More Comments
    • MangoPenguin@lemmy.blahaj.zone ⁨4⁩ ⁨weeks⁩ ago

      Nope, there’s no specific range of IPs that AI scrapers use.

      source