Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

⁨1162⁩ ⁨likes⁩

Submitted ⁨⁨3⁩ ⁨months⁩ ago⁩ by ⁨Pro@programming.dev⁩ to ⁨technology@lemmy.world⁩

https://i.imgur.com/0sDPk8a.png

cross-posted from: programming.dev/post/35852706

Image

Image

Image

Source.

source

Comments

Sort:hotnew top

SufferingSteve@feddit.nu ⁨3⁩ ⁨months⁩ ago
There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the air crawling cpu overhead.

What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.

source
- Marshezezz@lemmy.blahaj.zone ⁨3⁩ ⁨months⁩ ago
  Capitalism is grand, innit. Wait, not grand, I meant to say cancer
  
  source
  - Serinus@lemmy.world ⁨3⁩ ⁨months⁩ ago
    I feel like half of the blame capitalism gets is valid, but the other half is just society. I don’t care what kind of system you’re under, you’re going to have to deal with other people.
    
    Oh, and if you try the system where you don’t have to deal with people, that just means other people end up handling you.
    
    source
    -> View More Comments
- tourist@lemmy.world ⁨3⁩ ⁨months⁩ ago
  
  Web3 was about enabling us to securely transfer value between people digitally and without middlemen.
  
  It’s ironic that the middlemen showed up anyway and busted all the security of those transfers
  
  You want some bipcoin to buy weed drugs on the slip road? Don’t bother figuring out how to set up that wallet shit, come to our nifty token exchange where you can buy and sell all kinds of bipcoins
  
  oh btw every government on the planet showed up and dug through our insecure records. hope you weren’t actually buying shroom drugs on the slip rod
  
  also we got hacked, you lost all your bipcoins sorry
  
  At least, that’s my recollection of events. I was getting my illegal narcotics the old fashioned way.
  
  source
  - Chee_Koala@lemmy.world ⁨3⁩ ⁨months⁩ ago
    
    the old fashioned way.
    
    A whole swath of trained toads using a special made tube network?
    
    source
    -> View More Comments
  - raspberriesareyummy@lemmy.world ⁨3⁩ ⁨months⁩ ago
    
    also we got hacked, you lost all your bipcoins sorry
    
    aaaaaaaaand - it’s gone!
    
    source
  - prole@lemmy.blahaj.zone ⁨3⁩ ⁨months⁩ ago
    
    You want some bipcoin to buy weed drugs on the slip road? Don’t bother figuring out how to set up that wallet shit, come to our nifty token exchange where you can buy and sell all kinds of bipcoins
    
    Maybe I’m slow today, but what is this referencing? Most dark web sites use Monero. Is there some centralized token that people use instead?
    
    source
    -> View More Comments
- muusemuuse@sh.itjust.works ⁨3⁩ ⁨months⁩ ago
  Sound like it went the same way everything else went. The less money is involved the more trustworthy it is.
  
  source
- GreenShimada@lemmy.world ⁨3⁩ ⁨months⁩ ago
  Mr. Internet, tear down these walls! (for all these walled gardens)
  
  Return the internet to the wild. Let it run feral like dinosaurs on an island.
  
  Let the grannies and idiots stick themselves in the reservations and asylums run by billionaires.
  
  Let’s all make Neocities pages about our hobbies and dirtiest, innermost thoughts. With gifs all over.
  
  source
  - Furbag@lemmy.world ⁨3⁩ ⁨months⁩ ago
    I’m down with that. Web 1.5? Let’s do it. I’ll get my Geocities page up and then we can rev up that hit counter.
    
    source
    -> View More Comments
- kameecoding@lemmy.world ⁨3⁩ ⁨months⁩ ago
  
  Web3 was about enabling us to securely transfer value between people digitally and without middlemen
  
  I don’t think it ever was that, I think folding ideas has the best explanation of what it was meant to be, it was meant to be a way to grab power, away from those who already have it
  
  youtu.be/YQ_xWvX1n9g
  
  source
- vacuumflower@lemmy.sdf.org ⁨3⁩ ⁨months⁩ ago
  Much drama.
  
  I agree about semantic web, but the issue is with all of the Internet. Both its monopoly as the medium of communication, and its architecture.
  
  And if we go semantic for webpages, allowing the clients to construct representation, then we can go further, to separate data from medium, making messages and identities exist in a global space, as they (sort of, need a better solution) do in Usenet.
  
  About the Internet itself being the problem - that’s because it’s hierarchical, despite appearances, and nobody understands it well. Especially since new systems of this kind are not being built often, to say the least, so the majority of people using the Internet doesn’t even think about it as a system. It takes it for given that this is the only paradigm for the global network. And that it’s application-neutral, which may not be true.
  
  20 years ago, when I was a kid, people would think and imagine all kinds of things about the Internet and about the future and about ways all this can break, and these were normal people, not tech types, and one would think with time we wouldn’t become more certain, as it becomes bigger and bigger.
  
  OK, I’m just having an overvalued idea that the Internet is poisoned. Bad sleep, nasty weather, too much sweets eaten. Maybe that movement of packets on the IP protocol can somehow give someone free computation, with enough machines under their control, by using counters in the network stack as registers, or maybe something else.
  
  source
- NeilBru@lemmy.world ⁨3⁩ ⁨months⁩ ago
  The Simulacrum
  
  source
- hansolo@lemmy.today ⁨3⁩ ⁨months⁩ ago
  Preach!
  
  source
PhilipTheBucket@piefed.social ⁨3⁩ ⁨months⁩ ago
I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.

source
- turbowafflz@lemmy.world ⁨3⁩ ⁨months⁩ ago
  I think the best thing to do is to not block them when they’re detected but poison them instead. Feed them tons of text generated by tiny old language models, it’s harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn’t slow down real users, but you probably don’t need much power since the scrapers probably don’t really care about the speed
  
  source
  - xthexder@l.sw0.com ⁨3⁩ ⁨months⁩ ago
    I love catching bots in tarpits, it’s actually quite fun
    
    source
  - 31ank@ani.social ⁨3⁩ ⁨months⁩ ago
    Some guy also used zip bombs against AI crawlers, don’t know if it still works. Link to the lemmy post
    
    source
  - phx@lemmy.ca ⁨3⁩ ⁨months⁩ ago
    Yeah that was my thought. Don’t reject them, that’s obvious and they’ll work around it. Feed them shit data - but not too obviously shit - and they’ll not only swallow it but eventually build up to levels where it compromises them.
    
    I’ve suggested the same for plain old non-AI data stealing. Make the data useless to them and cost more work to separate good from bad, and they’ll eventually either sod off or die.
    
    A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s. It doesn’t need to be done real-time either as datasets can be generated in advance
    
    source
    -> View More Comments
  - sudo@programming.dev ⁨3⁩ ⁨months⁩ ago
    The problem is primarily the resource drain on the server and tarpitting tactics usually increase thag resource burden by maintaining the open connections.
    
    source
    -> View More Comments
- TIN@feddit.uk ⁨3⁩ ⁨months⁩ ago
  Wasn’t this called black ice in Neuromancer? Security systems that actively tried to harm the hacker?
  
  source
- traches@sh.itjust.works ⁨3⁩ ⁨months⁩ ago
  These crawlers come from random people’s devices via shady apps. Each request comes from a different IP
  
  source
  - AmbitiousProcess@piefed.social ⁨3⁩ ⁨months⁩ ago
    Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That's why in Codeberg's response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.
    
    For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.
    
    Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from "shady apps." Instead, they would simply rotate ASNs, and request a new IP.
    
    The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people's devices to route network traffic they're unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don't want.
    
    source
    -> View More Comments
  - sudo@programming.dev ⁨3⁩ ⁨months⁩ ago
    Or your TV or IOT devices. Residential proxies are extremely shady businesses.
    
    source
  - amelaxxx@piefed.social ⁨3⁩ ⁨months⁩ ago
    Yep
    
    source
  - PhilipTheBucket@piefed.social ⁨3⁩ ⁨months⁩ ago
    Is that really true? I guess I have no reason to doubt it, I just hadn't heard it before.
    
    source
    -> View More Comments
- NuXCOM_90Percent@lemmy.zip ⁨3⁩ ⁨months⁩ ago
  Yes. A nonprofit organization in Germany is going to be launching drone strikes globally. That is totally a better world.
  
  Its also important to understand that a significant chunk of these botnets are just normal people with viruses/compromised machines. And the fastest way to launch a DDOS attack is to… rent the same botnet from the same blackhat org to attack itself. And while that would be funny, I would also rather orgs I donate to not giving that money to blackhat orgs. But that is just me.
  
  source
  - bleistift2@sopuli.xyz ⁨3⁩ ⁨months⁩ ago
    en.wikipedia.org/wiki/Sarcasm
    
    source
- amelaxxx@piefed.social ⁨3⁩ ⁨months⁩ ago
  Right
  
  source
zifk@sh.itjust.works ⁨3⁩ ⁨months⁩ ago
Anubis isn’t supposed to be hard to avoid, but expensive to avoid. Not really surprised that a big company might be willing to throw a bunch of cash at it.

source
- sudo@programming.dev ⁨3⁩ ⁨months⁩ ago
  This is what I’ve kept saying about POW being a shit bot management tactic. Its a flat tax across all users, real or fake. The fake users are getting making money to access your site and will just eat the added expense. You can raise the tax to cost more than what your data is worth to them, but that also affects your real users. Nothing about Anubis even attempts to differentiate between bots and real users.
  
  If the bots take the time, they can set up a pipeline to solve Anubis tokens outside of the browser more efficiently than real users.
  
  source
  - black_flag@lemmy.dbzer0.com ⁨3⁩ ⁨months⁩ ago
    Yeah but ai companies are losing money so in the long run Anubis seems like it should eventually return to working.
    
    source
    -> View More Comments
  - OpenPassageways@lemmy.zip ⁨3⁩ ⁨months⁩ ago
    What the alternative?
    
    source
    -> View More Comments
- randomblock1@lemmy.world ⁨3⁩ ⁨months⁩ ago
  No, it’s expensive to comply (at a massive scale), but easy to avoid. Just change the user agent. There’s even a dedicated extension for bypassing Anubis.
  
  Even then AI servers have plenty of compute, it realistically doesn’t cost much. Maybe like a thousandth of a cent per solve? They’re spending billions on GPU power, they don’t care.
  
  I’ve been saying this since day 1 of Anubis but nobody wants to hear it.
  
  source
  - T156@lemmy.world ⁨3⁩ ⁨months⁩ ago
    The website would also have to display to users at the end of the day. It’s a similar problem as trying to solve media piracy. Worst comes to it, the crawlers could read the page like a person would.
    
    source
    -> View More Comments
prole@lemmy.blahaj.zone ⁨3⁩ ⁨months⁩ ago
Tech bros just actively making the internet worse for everyone.

source
- ShaggySnacks@lemmy.myserv.one ⁨3⁩ ⁨months⁩ ago
  
  Tech bros just actively making ~~the internet~~ society worse for everyone.
  
  FTFY.
  
  source
- iopq@lemmy.world ⁨3⁩ ⁨months⁩ ago
  I mean, tech bros of the past invented the internet
  
  source
  - notarobot@lemmy.zip ⁨3⁩ ⁨months⁩ ago
    Those are not the tech bros. The tech bros are the ones who move fast and break things. The internet was built by engineers and developers
    
    source
  - prole@lemmy.blahaj.zone ⁨3⁩ ⁨months⁩ ago
    Nah, that was DAARPA
    
    source
  - CeeBee_Eh@lemmy.world ⁨3⁩ ⁨months⁩ ago
    Those were tech nerds. “Tech bros” are jabronis who see the tech sector as a way to increase the value of the money their daddies gave them.
    
    source
londos@lemmy.world ⁨3⁩ ⁨months⁩ ago
Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.

source
- raspberriesareyummy@lemmy.world ⁨3⁩ ⁨months⁩ ago
  Did you just say use the words “useful” and “bitcoin” in the same sentence? o_O
  
  source
  - polle@feddit.org ⁨3⁩ ⁨months⁩ ago
    The saddest part is, we thought crypto was the biggest waste of energy ever and then the LLMs entered the chat.
    
    source
    -> View More Comments
  - kameecoding@lemmy.world ⁨3⁩ ⁨months⁩ ago
    Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.
    
    source
    -> View More Comments
  - londos@lemmy.world ⁨3⁩ ⁨months⁩ ago
    I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.
    
    source
    -> View More Comments
- T156@lemmy.world ⁨3⁩ ⁨months⁩ ago
  Not without making real users also mine bitcoin/avoiding the site because their performance tanked.
  
  source
- nymnympseudonym@lemmy.world ⁨3⁩ ⁨months⁩ ago
  The Monero community spent a long time trying to find a “useful PoW” function. The problem is that most computations that are useful are not also easy to verify as correct. javascript optimization was one direction that got pursued pretty far.
  
  But at the end of the day, a crypto that actually intends to withstand attacks from major governments requires a system that is decentralized, trustless, and verifiable, and the only solutions that have been found to date involve algorithms for which a GPU or even custom ASIC confers no significant advantage over a consumer-grade CPU.
  
  source
- 0x0@lemmy.zip ⁨3⁩ ⁨months⁩ ago
  Anubis does that. You may’ve seen it already.
  
  source
UnderpantsWeevil@lemmy.world ⁨3⁩ ⁨months⁩ ago
I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.

source
- devfuuu@lemmy.world [bot] ⁨3⁩ ⁨months⁩ ago
  I think every company in the world is telling e everyone for a few months now that ehat matter is AI data harvesting. There’s not even a hint of it being a question. You either accept the AI overlords or get out of the internet. Our ONLY purpose it to feed the machine, anything else is irrelevant. Play along or you shall be removed.
  
  source
  - ScoffingLizard@lemmy.dbzer0.com ⁨3⁩ ⁨months⁩ ago
    We need to poison better.
    
    source
  - gian@lemmy.grys.it ⁨3⁩ ⁨months⁩ ago
    
    get out of the internet.
    
    At some point, this would be the best option, sadly
    
    source
- willington@lemmy.dbzer0.com ⁨3⁩ ⁨months⁩ ago
  I was fine before the AI.
  
  The biggest customer of AI are the billionaires who can’t hire enough people for their technofeudalist/surveillance capitalism agenda. The billionaires (wannabe aristocrats) know that machines have no morals, no bottom lines, no scrupples, don’t leak info to the press, don’t complain, don’t demand to take time off or to work from home, etc.
  
  AI makes the perfect fascist.
  
  They sell AI like it’s a benefit to us all, but it ain’t that. It’s a benefit to the billionaires who think they own our world.
  
  Fuck AI.
  
  source
thatonecoder@lemmy.ca ⁨3⁩ ⁨months⁩ ago
I know this is the most ridiculous idea, but we need to pack our bags and make a new internet protocol, to separate us from the rest, at least for a while. Either way, most “modern” internet things (looking at you, JavaScript) are not modern at all, and starting over might help more than any of us could imagine.

source
- Pro@programming.dev ⁨3⁩ ⁨months⁩ ago
  Like Gemini?
  
  Gemini is a new internet technology supporting an electronic library of interconnected text documents. That’s not a new idea, but it’s not old fashioned either. It’s timeless, and deserves tools which treat it as a first class concept, not a vestigial corner case. Gemini isn’t about innovation or disruption, it’s about providing some respite for those who feel the internet has been disrupted enough already. We’re not out to change the world or destroy other technologies. We are out to build a lightweight online space where documents are just documents, in the interests of every reader’s privacy, attention and bandwidth.
  
  source
  - thatonecoder@lemmy.ca ⁨3⁩ ⁨months⁩ ago
    Yep! That was exactly the protocol on my mind. One thing, though, is that the Fediverse would need to be ported to Gemini, or at least for a new protocol to be created for Gemini.
    
    source
    -> View More Comments
  - cwista@lemmy.world ⁨3⁩ ⁨months⁩ ago
    Won’t the bots just adapt and move there too?
    
    source
  - 0x0@lemmy.zip ⁨3⁩ ⁨months⁩ ago
    It’s not the most well thought-out, from a technical perspective, but it’s pretty damn cool. Gemini pods are a freakin’ rabbi hole.
    
    source
  - vacuumflower@lemmy.sdf.org ⁨3⁩ ⁨months⁩ ago
    I’ve personally played with Gemini a few months ago, and now want a new Internet as opposed to a new Web.
    
    Replace IP protocols with something better. With some kind of relative addressing, and delay-tolerant synchronization being preferred to real-time connections between two computers. So that there were no permanent global addresses at all, and no centralized DNS.
    
    With the main “Web” over that being just replicated posts with tags hyperlinked by IDs, with IDs determined by content. Structured, like semantic web, so that a program could easily use such a post as directory of other posts or a source of text or retrieve binary content.
    
    With user identities being a kind of post content, and post authorship being too a kind of post content or maybe tag content, cryptographically signed.
    
    Except that would require to resolve post dependencies and retrieve them too with some depth limit, not just the post one currently opens, because, if it’d be like with bittorrent, half the hyperlinks in found posts would soon become dead, and also user identities would possibly soon become dead, making authorship check impossible.
    
    And posts (suppose even sites of that flatweb) being found by tags, maybe by author tag, maybe by some “channel” tag, maybe by “name” tag, one can imagine plenty of things.
    
    The main thing is to replace “clients connecting to a service” with “persons operating on messages replicated on the network”, with networked computers sharing data like echo or ripples on the water. In what would be the general application layer for such a system.
    
    OK, this is very complex to do and probably stupid.
    
    It’s also not exactly the same level as IP protocols, so this can work over the Internet, just like the Internet worked just fine, for some people, over packet radio and UUCP or FTN email gates and copper landlines. Just for the Internet to be the main layer in terms of which we find services, on the IP protocols, TCP, UDP, ICMP, all that, and various ones and DNS on application layer, - that I consider wrong, it’s too hierarchical. So it’s not a “replacement”.
    
    source
    -> View More Comments
oeuf@slrpnk.net ⁨3⁩ ⁨months⁩ ago
Crazy. DDoS attacks are illegal here in the UK.

source
- rdri@lemmy.world ⁨3⁩ ⁨months⁩ ago
  So, sue the attackers?
  
  source
- BlameTheAntifa@lemmy.world ⁨3⁩ ⁨months⁩ ago
  The problem is that hundreds of bad actors doing the same thing independently of one another means it does not qualify as a DDoS attack. Maybe it’s time we start legally restricting bots and crawlers, though.
  
  source
wetbeardhairs@lemmy.dbzer0.com ⁨3⁩ ⁨months⁩ ago
Gosh. Corporations are rampantly attempting to access resources so they can perform copyright infringement en-masse. I wonder if there is a legal mechanism to stop them? Oh, no there isn’t because our government is fully corrupted.

source
- aquovie@lemmy.cafe ⁨3⁩ ⁨months⁩ ago
  I think, in this particular case, it’s aggressive apathy/incompetence and not malice. Remember, Trump didn’t even know what Nvidia was.
  
  AI’s don’t have a skin color or use the bathroom so you can’t whip your cult into a frenzy by Othering it. You can’t solidify your fascism by getting bogged down in the details of IP law.
  
  source
  - Corkyskog@sh.itjust.works ⁨3⁩ ⁨months⁩ ago
    Just say that the AI will be used to train the immigrants to take der jerbs.
    
    source
    -> View More Comments
  - 0x0@lemmy.zip ⁨3⁩ ⁨months⁩ ago
    
    Trump didn’t even know what Nvidia was.
    
    I think you mean navidia.
    
    source
nialv7@lemmy.world ⁨3⁩ ⁨months⁩ ago
We had a trust based system for so long. No one is forced to honor robots.txt, but most big players did. Almost restores my faith in humanity a little bit. And then AI companies came and destroyed everything. This is we can’t have nice things.

source
- Shapillon@lemmy.world ⁨3⁩ ⁨months⁩ ago
  Big players are the ones behind most AIs though.
  
  source
zbyte64@awful.systems ⁨3⁩ ⁨months⁩ ago
Is there nightshade but for text and code? Maybe my source headers should include a bunch of special characters that then give a prompt injection. And sprinkle some nonsensical code comments before the real code comment.

source
- kuberoot@discuss.tchncs.de ⁨3⁩ ⁨months⁩ ago
  I think the issue is that text uses comparatively very little information, so you can’t just inject invisible changes by changing the least insignificant bits - you’d need to change the actual phrasing/spelling of your text/code, and that’d be noticable.
  
  source
- qaz@lemmy.world ⁨3⁩ ⁨months⁩ ago
  There are glitch tokens but I think those only effect it when using it.
  
  source
- Honytawk@feddit.nl ⁨3⁩ ⁨months⁩ ago
  Maybe like a bunch of white text at 2pt?
  
  Not visible to the user, but fully readable by crawlers.
  
  source
cupcakezealot@piefed.blahaj.zone ⁨3⁩ ⁨months⁩ ago
reminder to donate to codeberg and forgejo :)

source
0x0@lemmy.zip ⁨3⁩ ⁨months⁩ ago
It’s always a cat-n-mouse game.

source
Kyrgizion@lemmy.world ⁨3⁩ ⁨months⁩ ago
Eventually we’ll have “defensive” and “offensive” llm’s managing all kinds of electronic warfare automatically, effectively nullifying each other.

source
StopSpazzing@lemmy.world ⁨3⁩ ⁨months⁩ ago
Is there a migration tool? If not would be awesome to migrate everything includong issues and stuff. Bet even more people woild move.

source
mfed1122@discuss.tchncs.de ⁨3⁩ ⁨months⁩ ago
Okay what about…what about uhhh… Static site builders that render the whole page out as an image map, making it visible for humans but useless forccrawlers 🤔🤔🤔

source
sailorzoop@lemmy.librebun.com ⁨3⁩ ⁨months⁩ ago
I’m ashamed to say that I switched my DNS nameservers to CF just for their anti crawler service.
Knowing Cloudflare, god know how much longer it’ll be free for.

source
bizza@lemmy.zip ⁨3⁩ ⁨months⁩ ago
I use Anubis on my personal website, not because I think anything I’ve written is important enough that companies would want to scrape it, but as a “fuck you” to those companies regardless

That the bots are learning to get around it is disheartening, Anubis was a pain to setup and get running

source
Wispy2891@lemmy.world ⁨3⁩ ⁨months⁩ ago
Question: those artificial stupidity bots want to steal the issues or want to steal the code? Because why they’re wasting a lot of resources scraping millions of pages when they can steal everything via SSH (once a month, not 120 times a second)

source
Goretantath@lemmy.world ⁨3⁩ ⁨months⁩ ago
I knew that was the worse option. Use the one that traps them in an infinite maze.

source
Monument@lemmy.sdf.org ⁨3⁩ ⁨months⁩ ago
Increasingly, I’m reminded of this: Paul Bunyan vs. the spam bot (or how Paul Bunyan triggered the singularity to win a bet). It’s a medium-length read from the old internet, but fun.

source
interdimensionalmeme@lemmy.ml ⁨3⁩ ⁨months⁩ ago
Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ? Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ? Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhas

source
xxce2AAb@feddit.dk ⁨3⁩ ⁨months⁩ ago
If this isn’t fertile grounds for a massive class-action lawsuit, I don’t know what would be.

source
carrylex@lemmy.world ⁨3⁩ ⁨months⁩ ago
And once again a WebApplicationFirewall(WAF) was defeated and it turns out that blocklists and bot detection tools like fail2ban are the way to go…

source
e8CArkcAuLE@piefed.social ⁨3⁩ ⁨months⁩ ago
Image

how this felt like while reading

source