Open Menu
AllLocalCommunitiesAbout
lotide
AllLocalCommunitiesAbout
Login

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

⁨0⁩ ⁨likes⁩

Submitted ⁨⁨7⁩ ⁨months⁩ ago⁩ by ⁨Pro@programming.dev⁩ to ⁨technology@lemmy.world⁩

https://i.imgur.com/0sDPk8a.png

cross-posted from: programming.dev/post/35852706

Image

Image

Image

Source.

source

Comments

Sort:hotnewtop
  • bizza@lemmy.zip ⁨7⁩ ⁨months⁩ ago

    I use Anubis on my personal website, not because I think anything I’ve written is important enough that companies would want to scrape it, but as a “fuck you” to those companies regardless

    That the bots are learning to get around it is disheartening, Anubis was a pain to setup and get running

    source
  • nialv7@lemmy.world ⁨7⁩ ⁨months⁩ ago

    We had a trust based system for so long. No one is forced to honor robots.txt, but most big players did. Almost restores my faith in humanity a little bit. And then AI companies came and destroyed everything. This is we can’t have nice things.

    source
    • Shapillon@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Big players are the ones behind most AIs though.

      source
  • cupcakezealot@piefed.blahaj.zone ⁨7⁩ ⁨months⁩ ago

    reminder to donate to codeberg and forgejo :)

    source
  • interdimensionalmeme@lemmy.ml ⁨7⁩ ⁨months⁩ ago

    Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ? Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ? Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhas

    source
    • 0x0@lemmy.zip ⁨7⁩ ⁨months⁩ ago

      they won’t have to scrape ?

      They don’t have to scrape; especially if robots.txt tells them not to.

      it’s public data, it’s out there, do you want it public or not ?

      Hey, she was wearing a miniskirt, she wanted it, right?

      source
      • interdimensionalmeme@lemmy.ml ⁨7⁩ ⁨months⁩ ago

        No no no, you don’t get to invoke grape imagery to defend copyright.

        I know, it hurts when the human shields like wikipedia and the openwrt forums are getting hit, especially when they hand over the goods in dumps. But behind those human shields stand facebook, xitter, amazon, reddit and the rest of big tech garbage and I want tanks to run through them.

        So go back to your drawing board and find a solution the tech platform monopolist are made to relinquish our data back to use and the human shields also survive.

        My own mother is prisoner in the Zuckerberg data hive and the only way she can get out is brute zucking force into facebook’s poop chute.

        source
        • -> View More Comments
    • qaz@lemmy.world ⁨7⁩ ⁨months⁩ ago

      I think the issue is that the scrapers are fully automatically collecting text, jumping from link to link like a search engine indexer.

      source
    • dwzap@lemmy.world ⁨7⁩ ⁨months⁩ ago

      The Wikimedia Foundation does just that, and still, their infrastructure is under stress because of AI scrapers.

      Dumps or no dumps, these AI companies don’t care. They feel like they’re entitled to taking or stealing what they want.

      source
      • interdimensionalmeme@lemmy.ml ⁨7⁩ ⁨months⁩ ago

        That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.

        They also have an open API that makes scraper entirely unnecessary too.

        Here are the relevant quotes from the article you posted

        “Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”

        “At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”

        “Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”

        And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !

        The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.

        Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.

        source
    • umbraroze@slrpnk.net ⁨7⁩ ⁨months⁩ ago

      The problem isn’t that the data is already public.

      The problem is that the AI crawlers want to check on it every 5 minutes, even if you try to tell all crawlers that the file is updated daily, or that the file hasn’t been updated in a month.

      AI crawlers don’t care about robots.txt or other helpful hints about what’s worth crawling or not, and hints on when it’s good time to crawl again.

      source
      • interdimensionalmeme@lemmy.ml ⁨7⁩ ⁨months⁩ ago

        Yeah but there’s would be scrappers if the robots file just pointed to a dump file.

        Then the scraper could just do a spot check a few dozen random page and check the dump is actually up to date and complete and then they’d know they don’t need to waste any time there and move on.

        source
        • -> View More Comments
  • mfed1122@discuss.tchncs.de ⁨7⁩ ⁨months⁩ ago

    Okay what about…what about uhhh… Static site builders that render the whole page out as an image map, making it visible for humans but useless forccrawlers 🤔🤔🤔

    source
    • iopq@lemmy.world ⁨7⁩ ⁨months⁩ ago

      AI these days reads text from images better than humans can

      source
    • NeilBru@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Computer vision models can read/parse pixel geometry.

      source
    • echodot@feddit.uk ⁨7⁩ ⁨months⁩ ago

      AI is pretty good at OCR now. I think that would just make it worse for humans while making very little difference to the AI.

      source
      • mfed1122@discuss.tchncs.de ⁨7⁩ ⁨months⁩ ago

        The crawlers are likely not AI though, but yes OCR could be done effectively without AI anyways. This idea ultimately boils down to the same hope Anubis had of making the processing costs large enough to not be worth it.

        source
        • -> View More Comments
    • Baleine@jlai.lu ⁨7⁩ ⁨months⁩ ago

      Humans that don’t see:

      source
    • lapping6596@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Accessibility gets throw out the window?

      source
      • mfed1122@discuss.tchncs.de ⁨7⁩ ⁨months⁩ ago

        I wasn’t being totally serious, but also, I do think that while accessibility concerns come from a good place, there is some practical limitation that must be accepted when building fringe and counter-cultural things. Like, my hidden rebel base can’t have a wheelchair accessible ramp at the entrance, because then my base isn’t hidden anymore. It sucks that some solutions can’t work for everyone, but if we just throw them out because it won’t work for 5% of people, we end up with nothing. I’d rather have a solution that works for 95% of people than no solution at all. I’m not saying that people who use screen readers are second-class citizens. If crawlers were vision-based then I might suggest matching text to background colors so that only screen readers work to understand the site. Because something that works for 5% of people is also better than no solution at all. We need to tolerate having imperfect first attempts and understand that more sophisticated infrastructure comes later.

        But yes my image map idea is pretty much a joke nonetheless

        source
        • -> View More Comments
  • prole@lemmy.blahaj.zone ⁨7⁩ ⁨months⁩ ago

    Tech bros just actively making the internet worse for everyone.

    source
    • iopq@lemmy.world ⁨7⁩ ⁨months⁩ ago

      I mean, tech bros of the past invented the internet

      source
      • CeeBee_Eh@lemmy.world ⁨7⁩ ⁨months⁩ ago

        Those were tech nerds. “Tech bros” are jabronis who see the tech sector as a way to increase the value of the money their daddies gave them.

        source
      • prole@lemmy.blahaj.zone ⁨7⁩ ⁨months⁩ ago

        Nah, that was DAARPA

        source
      • notarobot@lemmy.zip ⁨7⁩ ⁨months⁩ ago

        Those are not the tech bros. The tech bros are the ones who move fast and break things. The internet was built by engineers and developers

        source
    • ShaggySnacks@lemmy.myserv.one ⁨7⁩ ⁨months⁩ ago

      Tech bros just actively making the internet society worse for everyone.

      FTFY.

      source
  • Monument@lemmy.sdf.org ⁨7⁩ ⁨months⁩ ago

    Increasingly, I’m reminded of this: Paul Bunyan vs. the spam bot (or how Paul Bunyan triggered the singularity to win a bet). It’s a medium-length read from the old internet, but fun.

    source
  • thatonecoder@lemmy.ca ⁨7⁩ ⁨months⁩ ago

    I know this is the most ridiculous idea, but we need to pack our bags and make a new internet protocol, to separate us from the rest, at least for a while. Either way, most “modern” internet things (looking at you, JavaScript) are not modern at all, and starting over might help more than any of us could imagine.

    source
    • Pro@programming.dev ⁨7⁩ ⁨months⁩ ago

      Like Gemini?

      Gemini is a new internet technology supporting an electronic library of interconnected text documents. That’s not a new idea, but it’s not old fashioned either. It’s timeless, and deserves tools which treat it as a first class concept, not a vestigial corner case. Gemini isn’t about innovation or disruption, it’s about providing some respite for those who feel the internet has been disrupted enough already. We’re not out to change the world or destroy other technologies. We are out to build a lightweight online space where documents are just documents, in the interests of every reader’s privacy, attention and bandwidth.

      source
      • 0x0@lemmy.zip ⁨7⁩ ⁨months⁩ ago

        It’s not the most well thought-out, from a technical perspective, but it’s pretty damn cool. Gemini pods are a freakin’ rabbi hole.

        source
      • vacuumflower@lemmy.sdf.org ⁨7⁩ ⁨months⁩ ago

        I’ve personally played with Gemini a few months ago, and now want a new Internet as opposed to a new Web.

        Replace IP protocols with something better. With some kind of relative addressing, and delay-tolerant synchronization being preferred to real-time connections between two computers. So that there were no permanent global addresses at all, and no centralized DNS.

        With the main “Web” over that being just replicated posts with tags hyperlinked by IDs, with IDs determined by content. Structured, like semantic web, so that a program could easily use such a post as directory of other posts or a source of text or retrieve binary content.

        With user identities being a kind of post content, and post authorship being too a kind of post content or maybe tag content, cryptographically signed.

        Except that would require to resolve post dependencies and retrieve them too with some depth limit, not just the post one currently opens, because, if it’d be like with bittorrent, half the hyperlinks in found posts would soon become dead, and also user identities would possibly soon become dead, making authorship check impossible.

        And posts (suppose even sites of that flatweb) being found by tags, maybe by author tag, maybe by some “channel” tag, maybe by “name” tag, one can imagine plenty of things.

        The main thing is to replace “clients connecting to a service” with “persons operating on messages replicated on the network”, with networked computers sharing data like echo or ripples on the water. In what would be the general application layer for such a system.

        OK, this is very complex to do and probably stupid.

        It’s also not exactly the same level as IP protocols, so this can work over the Internet, just like the Internet worked just fine, for some people, over packet radio and UUCP or FTN email gates and copper landlines. Just for the Internet to be the main layer in terms of which we find services, on the IP protocols, TCP, UDP, ICMP, all that, and various ones and DNS on application layer, - that I consider wrong, it’s too hierarchical. So it’s not a “replacement”.

        source
        • -> View More Comments
      • cwista@lemmy.world ⁨7⁩ ⁨months⁩ ago

        Won’t the bots just adapt and move there too?

        source
      • thatonecoder@lemmy.ca ⁨7⁩ ⁨months⁩ ago

        Yep! That was exactly the protocol on my mind. One thing, though, is that the Fediverse would need to be ported to Gemini, or at least for a new protocol to be created for Gemini.

        source
        • -> View More Comments
  • zbyte64@awful.systems ⁨7⁩ ⁨months⁩ ago

    Is there nightshade but for text and code? Maybe my source headers should include a bunch of special characters that then give a prompt injection. And sprinkle some nonsensical code comments before the real code comment.

    source
    • qaz@lemmy.world ⁨7⁩ ⁨months⁩ ago

      There are glitch tokens but I think those only effect it when using it.

      source
    • Honytawk@feddit.nl ⁨7⁩ ⁨months⁩ ago

      Maybe like a bunch of white text at 2pt?

      Not visible to the user, but fully readable by crawlers.

      source
      • ramjambamalam@lemmy.ca ⁨7⁩ ⁨months⁩ ago

        If a bot can’t read it, nor can a visually impaired user.

        source
        • -> View More Comments
    • kuberoot@discuss.tchncs.de ⁨7⁩ ⁨months⁩ ago

      I think the issue is that text uses comparatively very little information, so you can’t just inject invisible changes by changing the least insignificant bits - you’d need to change the actual phrasing/spelling of your text/code, and that’d be noticable.

      source
  • StopSpazzing@lemmy.world ⁨7⁩ ⁨months⁩ ago

    Is there a migration tool? If not would be awesome to migrate everything includong issues and stuff. Bet even more people woild move.

    source
    • BlameTheAntifa@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Codeberg has very good migration tools built in. You need to do one repo at a time, but it can move issues, releases, and everything.

      source
    • dodos@lemmy.world ⁨7⁩ ⁨months⁩ ago

      There are migration tools, but not a good bulk one that I could find. It worked for my repos except for my unreal engine fork.

      source
  • Wispy2891@lemmy.world ⁨7⁩ ⁨months⁩ ago

    Question: those artificial stupidity bots want to steal the issues or want to steal the code? Because why they’re wasting a lot of resources scraping millions of pages when they can steal everything via SSH (once a month, not 120 times a second)

    source
    • lime@feddit.nu ⁨7⁩ ⁨months⁩ ago

      they just want all text

      source
    • Passerby6497@lemmy.world ⁨7⁩ ⁨months⁩ ago

      That would require having someone with real intelligence running the scraper.

      source
  • londos@lemmy.world ⁨7⁩ ⁨months⁩ ago

    Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.

    source
    • 0x0@lemmy.zip ⁨7⁩ ⁨months⁩ ago

      Anubis does that. You may’ve seen it already.

      source
    • nymnympseudonym@lemmy.world ⁨7⁩ ⁨months⁩ ago

      The Monero community spent a long time trying to find a “useful PoW” function. The problem is that most computations that are useful are not also easy to verify as correct. javascript optimization was one direction that got pursued pretty far.

      But at the end of the day, a crypto that actually intends to withstand attacks from major governments requires a system that is decentralized, trustless, and verifiable, and the only solutions that have been found to date involve algorithms for which a GPU or even custom ASIC confers no significant advantage over a consumer-grade CPU.

      source
    • T156@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Not without making real users also mine bitcoin/avoiding the site because their performance tanked.

      source
    • raspberriesareyummy@lemmy.world ⁨7⁩ ⁨months⁩ ago

      Did you just say use the words “useful” and “bitcoin” in the same sentence? o_O

      source
      • polle@feddit.org ⁨7⁩ ⁨months⁩ ago

        The saddest part is, we thought crypto was the biggest waste of energy ever and then the LLMs entered the chat.

        source
        • -> View More Comments
      • londos@lemmy.world ⁨7⁩ ⁨months⁩ ago

        I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.

        source
        • -> View More Comments
      • kameecoding@lemmy.world ⁨7⁩ ⁨months⁩ ago

        Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.

        source
        • -> View More Comments
  • e8CArkcAuLE@piefed.social ⁨7⁩ ⁨months⁩ ago

    Image

    how this felt like while reading

    source
  • oeuf@slrpnk.net ⁨7⁩ ⁨months⁩ ago

    Crazy. DDoS attacks are illegal here in the UK.

    source
    • BlameTheAntifa@lemmy.world ⁨7⁩ ⁨months⁩ ago

      The problem is that hundreds of bad actors doing the same thing independently of one another means it does not qualify as a DDoS attack. Maybe it’s time we start legally restricting bots and crawlers, though.

      source
    • rdri@lemmy.world ⁨7⁩ ⁨months⁩ ago

      So, sue the attackers?

      source
  • Goretantath@lemmy.world ⁨7⁩ ⁨months⁩ ago

    I knew that was the worse option. Use the one that traps them in an infinite maze.

    source
    • aquovie@lemmy.cafe ⁨7⁩ ⁨months⁩ ago

      You need to properly detect that they’re bots first and then they’ll just figure out how to spoof that. Then you’re back to square one.

      Abstractly, POW doesn’t need to determine if you’re a bot or not. To make a request, as a human or bot, you need to pay in cpu-time. The hope is that the cost is not so high that a human notices very much but for a bot trying to hoover up data as fast as possible, the aggregate cost is high.

      I think the more horrifying aspect is that they’ll just build ever bigger datacenters to crunch POW tests faster and the carbon cost will skyrocket even more.

      source
      • Auth@lemmy.world ⁨7⁩ ⁨months⁩ ago

        Trap users in the maze as well :)

        source
      • nialv7@lemmy.world ⁨7⁩ ⁨months⁩ ago

        Oh I haven’t even considered the carbon aspect. Anubis is an even worse idea than I previously thought…

        source
      • mic_check_one_two@lemmy.dbzer0.com ⁨7⁩ ⁨months⁩ ago

        Exactly. Imagine needing to pay a penny for every request. Not a huge deal for someone who only makes one or two requests per year. But if you’re running a bot farm and making tens of millions of requests per day, you’ll quickly find that your operating costs have skyrocketed. That’s basically the idea behind Anubis; Make someone pay in CPU time, so the legit users don’t really notice but bots quickly eat up all of their servers’ CPU.

        source
  • carrylex@lemmy.world ⁨7⁩ ⁨months⁩ ago

    And once again a WebApplicationFirewall(WAF) was defeated and it turns out that blocklists and bot detection tools like fail2ban are the way to go…

    source
  • wetbeardhairs@lemmy.dbzer0.com ⁨7⁩ ⁨months⁩ ago

    Gosh. Corporations are rampantly attempting to access resources so they can perform copyright infringement en-masse. I wonder if there is a legal mechanism to stop them? Oh, no there isn’t because our government is fully corrupted.

    source
    • aquovie@lemmy.cafe ⁨7⁩ ⁨months⁩ ago

      I think, in this particular case, it’s aggressive apathy/incompetence and not malice. Remember, Trump didn’t even know what Nvidia was.

      AI’s don’t have a skin color or use the bathroom so you can’t whip your cult into a frenzy by Othering it. You can’t solidify your fascism by getting bogged down in the details of IP law.

      source
      • 0x0@lemmy.zip ⁨7⁩ ⁨months⁩ ago

        Trump didn’t even know what Nvidia was.

        I think you mean navidia.

        source
      • Corkyskog@sh.itjust.works ⁨7⁩ ⁨months⁩ ago

        Just say that the AI will be used to train the immigrants to take der jerbs.

        source
        • -> View More Comments
  • zifk@sh.itjust.works ⁨7⁩ ⁨months⁩ ago

    Anubis isn’t supposed to be hard to avoid, but expensive to avoid. Not really surprised that a big company might be willing to throw a bunch of cash at it.

    source
    • randomblock1@lemmy.world ⁨7⁩ ⁨months⁩ ago

      No, it’s expensive to comply (at a massive scale), but easy to avoid. Just change the user agent. There’s even a dedicated extension for bypassing Anubis.

      Even then AI servers have plenty of compute, it realistically doesn’t cost much. Maybe like a thousandth of a cent per solve? They’re spending billions on GPU power, they don’t care.

      I’ve been saying this since day 1 of Anubis but nobody wants to hear it.

      source
      • T156@lemmy.world ⁨7⁩ ⁨months⁩ ago

        The website would also have to display to users at the end of the day. It’s a similar problem as trying to solve media piracy. Worst comes to it, the crawlers could read the page like a person would.

        source
        • -> View More Comments
    • sudo@programming.dev ⁨7⁩ ⁨months⁩ ago

      This is what I’ve kept saying about POW being a shit bot management tactic. Its a flat tax across all users, real or fake. The fake users are getting making money to access your site and will just eat the added expense. You can raise the tax to cost more than what your data is worth to them, but that also affects your real users. Nothing about Anubis even attempts to differentiate between bots and real users.

      If the bots take the time, they can set up a pipeline to solve Anubis tokens outside of the browser more efficiently than real users.

      source
      • OpenPassageways@lemmy.zip ⁨7⁩ ⁨months⁩ ago

        What the alternative?

        source
        • -> View More Comments
      • black_flag@lemmy.dbzer0.com ⁨7⁩ ⁨months⁩ ago

        Yeah but ai companies are losing money so in the long run Anubis seems like it should eventually return to working.

        source
        • -> View More Comments
  • sailorzoop@lemmy.librebun.com ⁨7⁩ ⁨months⁩ ago

    I’m ashamed to say that I switched my DNS nameservers to CF just for their anti crawler service.
    Knowing Cloudflare, god know how much longer it’ll be free for.

    source
    • AmbiguousProps@lemmy.today ⁨7⁩ ⁨months⁩ ago

      Did you enable the AI black hole/tarpit? It’s the main reason I’ve used their stuff.

      source
      • sailorzoop@lemmy.librebun.com ⁨7⁩ ⁨months⁩ ago

        TIL! Just enabled it, thanks

        source
  • SufferingSteve@feddit.nu ⁨7⁩ ⁨months⁩ ago

    There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the air crawling cpu overhead.

    What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.

    Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

    What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.

    source
  • xxce2AAb@feddit.dk ⁨7⁩ ⁨months⁩ ago

    If this isn’t fertile grounds for a massive class-action lawsuit, I don’t know what would be.

    source
  • Kyrgizion@lemmy.world ⁨7⁩ ⁨months⁩ ago

    Eventually we’ll have “defensive” and “offensive” llm’s managing all kinds of electronic warfare automatically, effectively nullifying each other.

    source
  • UnderpantsWeevil@lemmy.world ⁨7⁩ ⁨months⁩ ago

    I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.

    source
  • PhilipTheBucket@piefed.social ⁨7⁩ ⁨months⁩ ago

    I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.

    source
  • 0x0@lemmy.zip ⁨7⁩ ⁨months⁩ ago

    It’s always a cat-n-mouse game.

    source