FOSS infrastructure is under attack by AI companies

⁨478⁩ ⁨likes⁩

Submitted ⁨⁨4⁩ ⁨months⁩ ago⁩ by ⁨simple@lemm.ee⁩ to ⁨technology@lemmy.world⁩

https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

source

Comments

Sort:hotnew top

db0@lemmy.dbzer0.com ⁨4⁩ ⁨months⁩ ago
Yep, it hit many lemmy servers as well, including mine. I had to block multiple alibaba subnet to get things back to normal. But I’m expecting the next spam wave.

source
fjordo@feddit.uk ⁨4⁩ ⁨months⁩ ago
I wish these companies would realise that acting like this is a very fast way to get scraping outlawed altogether, which is a shame because it can be genuinely useful (archival, automation, etc).

source
- jol@discuss.tchncs.de ⁨4⁩ ⁨months⁩ ago
  How can you outlaw something a company in another conhtinent is doing? And specially when they are becoming better as disguising themselves as normal traffic? What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.
  
  source
  - MoogleMaestro@lemmy.zip ⁨4⁩ ⁨months⁩ ago
    
    What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.
    
    You’re right. Which is exactly why companies should be exhibiting better behaviour and self regulate before they make the internet infinitely worse off for everyone.
    
    source
    -> View More Comments
  - Buelldozer@lemmy.today ⁨4⁩ ⁨months⁩ ago
    
    What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.
    
    Yes, because like or not that’s the only possible solution. If all traffic was required to be signed and the signatures were tied to an entity then you could refuse unsigned traffic and if signed traffic was causing problems you’d know who it was and have recourse.
    
    I don’t like this solution but it’s the only way forward that I can see.
    
    source
    -> View More Comments
Buelldozer@lemmy.today ⁨4⁩ ⁨months⁩ ago
I too read Drew DeVault’s article the other day and I’m still wondering how the hell these companies have access to “tens of thousands” of unique IP addresses. Seriously, how the hell do they have access to so many IP addresses that SysAdmins are resorting to banning entire countries to make it stop?

source
- festus@lemmy.ca ⁨4⁩ ⁨months⁩ ago
  There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users. They route traffic through these IPs via malware, hacked routers, “free” VPN clients, etc. If you block the IP range for one of these addresses you’ll also block real users.
  
  source
  - Buelldozer@lemmy.today ⁨4⁩ ⁨months⁩ ago
    
    There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users.
    
    Now that makes sense. I hadn’t considered rogue ISPs.
    
    source
    -> View More Comments
- werefreeatlast@lemmy.world ⁨4⁩ ⁨months⁩ ago
  If you get something like 156.67.234.6, then 7, then 56 etc just block 156.67.0.0/24
  
  source
  - Buelldozer@lemmy.today ⁨4⁩ ⁨months⁩ ago
    Sure, network blocking like this has been a thing for decades but it still requires ongoing manual intervention which is what these SysAdmins are complaining about.
    
    source
- GreenKnight23@lemmy.world ⁨4⁩ ⁨months⁩ ago
  [deleted]
  source
  - Buelldozer@lemmy.today ⁨4⁩ ⁨months⁩ ago
    
    fail2ban
    
    I’m familiar with f2b. I even have several clients licensed with the commercial version but it doesn’t fit this use case as there’s no logon failure for it to work with.
    
    I automatically ban any IP that comes from outside the US because there’s literally no reason for anyone outside the US to make requests to my infra.
    
    I have systems setup with geo-blocking but it’s of limited use due to the prevalence of VPNs.
    
    also, use a WAF on a NAT to expose your apps.
    
    This isn’t a solution either because a WAF has no way to know what traffic is bad so it doesn’t know what to block.
    
    source
klu9@lemmy.ca ⁨4⁩ ⁨months⁩ ago
The Linux Mint forums have been knocked offline multiple times over the last few months, to the point where the admins had to block all Chinese and Brazilian IPs for a while.

source
- deeferg@lemmy.world ⁨4⁩ ⁨months⁩ ago
  This is the first I’ve heard about Brazil in this type of cyber attack. Is it re-routed traffic going there or are there a large number of Brazilian bot farms now?
  
  source
  - klu9@lemmy.ca ⁨4⁩ ⁨months⁩ ago
    I don’t know why/how, just know that the admins saw the servers were being overwhelmed by traffic from Brazilian IPs and blocked it for a while.
    
    source
melpomenesclevage@lemmy.dbzer0.com ⁨4⁩ ⁨months⁩ ago
i hear there’s a tool called ‘nepenthe’ that creates a loop for an LLM, if you use that in combination with a fairly tight blacklist of IP’s you’re certain are LLM crawlers, I bet you could do a lot of damage, and maybe make them slow their shit down, or do this in a more reasonable way.

source
- PrivacyDingus@lemmy.world ⁨4⁩ ⁨months⁩ ago
  
  nepenthe
  
  It’s a Markov-chain-based text generator which could be difficult for people to implement on repos depending upon how they’re hosting them. Regardless, any sensibly-built crawler will have rate limits. This means that although Nepenthe is an interesting thought exercise, it’s only going to do anything to things knocked together by people who haven’t thought about it, not the Big Big companies with the real resources who are likely having the biggest impact.
  
  source
  - melpomenesclevage@lemmy.dbzer0.com ⁨4⁩ ⁨months⁩ ago
    might hit a few times, or maybe there’s a version that can puff stuff up the data in the sense of space, and salt it in the sense of utility.
    
    source
    -> View More Comments
Fijxu@programming.dev ⁨4⁩ ⁨months⁩ ago
AI scrapping is so cancerous. I host a public RedLib instance (redlib.nadeko.net) and due to BingBot and Amazon bots, my instance was always rate limited because the amount of requests they do is insane. What makes me more angry, is that this fucking fuck fuckers use free, privacy respecting services to be able to access Reddit and scrape . THEY CAN’T BE SO GREEDY. Hopefully, blocking their user-agent works fine ;)

source
- green@feddit.nl ⁨4⁩ ⁨months⁩ ago
  Thanks for hosting your instances. I use them often and they’re really well maintained
  
  source
- enrich@programming.dev ⁨4⁩ ⁨months⁩ ago
  I posted on your guestbook but the link was broken.
  
  I’d say be wary of Anubis author.
  
  I noticed you started using Anubis recently. Take a look here github.com/Xe/x/issues/701 also PRs 702, 703, 704
  
  -She made GNOME say something the project doesn’t agree
  -She tried to push her beliefs where it was unnecessary and disrespectful
  -She still refuses to remove things in her code that is disrespectful, some are mere comments and serves no real purpose
  -She refused to accept PRs, discuss changes, refuse dictionary definitions seemingly because of her ego
  -After all that she locked conversations in the issue/PRs as a result nobody else can show support now
  
  If she has a belief, there are other mediums/ways to express it, why like this?
  
  This is unwelcoming and definitely not FOSS spirit.
  
  source
grue@lemmy.world ⁨4⁩ ⁨months⁩ ago
ELI5 why the AI companies can’t just clone the git repos and do all the slicing and dicing (running git blame etc.) locally instead of running expensive queries on the projects’ servers?

source
- Realitaetsverlust@lemmy.zip ⁨4⁩ ⁨months⁩ ago
  Because that would cost you money, so just “abusing” someone else’s infrastructure is much cheaper.
  
  source
- zovits@lemmy.world ⁨4⁩ ⁨months⁩ ago
  Takes more effort and results in a static snapshot without being able to track the evolution of the project. (disclaimer: I don’t work with ai, but I’d bet this is the reason and also I don’t intend to defend those scraping twatwaffles in any way, but to offer a possible explanation)
  
  source
  - Sturgist@lemmy.ca ⁨4⁩ ⁨months⁩ ago
    Also having your victim host the costs is an added benefit
    
    source
- green@feddit.nl ⁨4⁩ ⁨months⁩ ago
  Too many people overestimate the actual capabilities of these companies.
  
  I really do not like saying this because it lacks a lot of nuance, but 90% of programmers are not skilled in their profession. This is not to say they are stupid (though they likely are, see cat-v/harmful) but they do not care about efficiency nor gracefulness - as long as the job gets done.
  
  You assume they are using source control (which is unironically unlikely), you assume they know that they can run a server locally (which I pray they do), and you assume their deadlines allow them to think about actual solutions to problems (which they probably don’t)
  
  Yes, they get paid a lot of money. But this does not say much about skill in an age of apathy and lawlessness
  
  source
  - turmacar@lemmy.world ⁨4⁩ ⁨months⁩ ago
    Also, everyone’s solution to a problem is stupid if they’re only given 5 minutes to work on it.
    
    Combine that with it being “free” for them to query the website and expensive to have enough local storage to replicate, even temporarily, all the stuff they want to scrape and it’s kind of a no brainier to ‘just not do that’. The only thing stopping them is morals / whether they want to keep paying rent.
    
    source
grysbok@lemmy.sdf.org ⁨4⁩ ⁨months⁩ ago
It’s also a huge problem for library/archive/museum websites. We try so hard to make data available to everyone, then some rude bots come along and bring the site down. Adding more resources just uses more resources–the bots expand to fill the container.

source
RobotToaster@mander.xyz ⁨4⁩ ⁨months⁩ ago
If an AI is detecting bugs, the least it could do is file a pull request, these things are supposed to be master coders right? 🙃

source
- reksas@sopuli.xyz ⁨4⁩ ⁨months⁩ ago
  to me, ai is a bit like bucket of water if you replace the water with “information”. Its a tool and it cant do anything on its own, you could make a program and instruct it to do something but it would work just as chaotically as when you generate stuff with ai. It annoys me so much to see so many(people in general) think that what they call ai is in anyway capable of independent action. It just does what you tell it to do and it does it based on how it has been trained, which is also why relying on ai trained by someone you shouldnt trust is bad idea.
  
  source
MonkderVierte@lemmy.ml ⁨4⁩ ⁨months⁩ ago
Assuming we could build a new internet fron the ground, what would be the solution? IPFS?

source
- Buelldozer@lemmy.today ⁨4⁩ ⁨months⁩ ago
  
  what would be the solution?
  
  Simple, not allowing anonymous activity. If everything was required to be crypto-graphically signed in such a way that it was tied to a known entity then this could be directly addressed. It’s essentially the same problem that e-mail has with SPAM and not allowing anonymous traffic would mostly solve that problem as well.
  
  Of course many internet users would (rightfully) fight that solution tooth and nail.
  
  source
  - shortwavesurfer@lemmy.zip ⁨4⁩ ⁨months⁩ ago
    Proof of work before connections are established. The Tor network implemented this in August of 2023 and it has helped a ton.
    
    source
    -> View More Comments
  - MonkderVierte@lemmy.ml ⁨4⁩ ⁨months⁩ ago
    No, that’s not a solution, since it would make privacy impossible and bad actors would still find ways around.
    
    source
- melpomenesclevage@lemmy.dbzer0.com ⁨4⁩ ⁨months⁩ ago
  take the resources from them so they don’t have them anymore. possibly something involving a coal mine for CEO’s.
  
  source
- cy_narrator@discuss.tchncs.de ⁨4⁩ ⁨months⁩ ago
  AI will come up there to abuse it as well
  
  source
pineapplelover@lemm.ee ⁨4⁩ ⁨months⁩ ago
They’re afraid

source
daq@lemmy.sdf.org ⁨4⁩ ⁨months⁩ ago
I’m not sure how they actually implemented it, but you can easily block ML crawlers via cloud flare. Isn’t just about every small site/service behind CF anyway?

source
- grysbok@lemmy.sdf.org ⁨4⁩ ⁨months⁩ ago
  Last I checked, cloudflare requires the user to have JavaScript and cookies enabled. My institution doesn’t want to require those because it would likely impact legitimate users as well as bots.
  
  source
  - daq@lemmy.sdf.org ⁨4⁩ ⁨months⁩ ago
    Huh? I can reach my site via curl that has neither. How did you come up with this random set of requirements?
    
    source
01189998819991197253@infosec.pub ⁨4⁩ ⁨months⁩ ago
Failtoban should add all those scraper IPs, and we need to just flat out block them. Or send them to those mazes. Or redirect them to themselves lol

source