A Project to Poison LLM Crawlers

Submitted ⁨⁨1⁩ ⁨month⁩ ago⁩ by ⁨Disillusionist@piefed.world⁩ to ⁨technology@lemmy.world⁩

Website operators are being asked to feed LLM crawlers poisoned data by a project called Poison Fountain.

The project page links to URLs which provide a practically endless stream of poisoned training data. They have determined that this approach is very effective at ultimately sabotaging the quality and accuracy of AI which has been trained on it.

Small quantities of poisoned training data can significantly damage a language model.

The page also gives suggestions on how to put the provided resources to use.

source

Comments

Sort:hotnew top

termaxima@slrpnk.net ⁨1⁩ ⁨month⁩ ago
Been thinking about making one of these too, especially since I have a catchy name : asbestos

source
- ZILtoid1991@lemmy.world ⁨1⁩ ⁨month⁩ ago
  Me too, but with procedural image generation. Use some templates which are put together with CPU blitter (extremely fast and effective), add some random descriptive text, then done. Don’t know how much my theory would work IRL.
  
  source
vacuumflower@lemmy.sdf.org ⁨1⁩ ⁨month⁩ ago
If, suppose, I were optimistic over this technology, but pessimistic over its current stage of development, I’d expect this to be a cure. It’s a problem they’ll have to solve. A test they’ll have to pass.

If somewhere inside those things someone makes a mechanism building a graph of syllogisms, no kind of poisoned input data will be able to hurt them.

So - this is a good thing, but when people say it’s a rebellion, it’s not.

source
- FlashMobOfOne@lemmy.world ⁨1⁩ ⁨month⁩ ago
  
  A test they’ll have to pass.
  
  This makes me chuckle, as they invented euphemisms like ‘hallucinations’ because their LLM models can’t do what they promise. Fabulous marketing, but clearly they didn’t do enough testing.
  
  source
  - Bazoogle@lemmy.world ⁨1⁩ ⁨month⁩ ago
    
    as they invented euphemisms like ‘hallucinations’
    
    Seems like a pretty accurate word to use, no? Could also use fabrication, concoction, phantom, or something else? I think “lie” and its synonyms are not accurate, since that requires intent. Since the LLM does not have intent, it cannot “lie”.
    
    source
    -> View More Comments
  - vacuumflower@lemmy.sdf.org ⁨1⁩ ⁨month⁩ ago
    I said, in other words, that it doesn’t matter what they do until this problem is solved. So if this is described as some sort of rebellion against AI (or “AI”), then no. At the point where it becomes dangerous technology in itself and not just for economy, it won’t be.
    
    source
- Disillusionist@piefed.world ⁨1⁩ ⁨month⁩ ago
  Not all problems may be cured immediately. Battles are rarely won with a single attack. A good thing is not the same as nothing.
  
  source
- treesapx@lemmy.world ⁨1⁩ ⁨month⁩ ago
  “You’re not opposing me. All you’ve done is create a problem that will stop me until I have it figured out.” is the description of every struggle between opposing forces, so it’s interesting that you disagree with that.
  
  source
  - vacuumflower@lemmy.sdf.org ⁨1⁩ ⁨month⁩ ago
    Not really, more like “if I can find a key to the door, I can open it, so engraving a fixed combination for the door lock on the same key doesn’t change much”.
    
    Poisoned data is fundamentally valid data. Concepts of logical connectivity and statements being true or false are something needed to use it.
    
    source
- a_non_monotonic_function@lemmy.world ⁨1⁩ ⁨month⁩ ago
  You ascribe far too much to the internal workings than is reasonable.
  
  source
vane@lemmy.world ⁨1⁩ ⁨month⁩ ago
I have around 10-20GB github / gitlab mirror. I am constantly under attack from crawlers from top US technology corporations and LLM startups. Whenever I ban one IP range they switch to other - I don’t know if those fuckers have tickets in their systems to do it manually or they just deploy this shit all over the planet. From what I observe during attacks that I mitigate the best way to poison them is to just create gitea instance with poisoned code repository and couple hundred revisions. It’s because what they are most interested in is html representation of diff between two git revisions.

source
- E_coli42@lemmy.world ⁨1⁩ ⁨month⁩ ago
  Why isn’t there anything in the DMCA for stopping crawlers? They have stuff about requiring crawlers to follow attribution and whatnot, but nothing for not allowing crawlers in the first place. Stupid as shit.
  
  source
- douglasg14b@lemmy.world ⁨1⁩ ⁨month⁩ ago
  I can get a 50Gb/s residential link where I am, and have a whole rack of servers.
  
  Sounds like a good opportunity to crowd fund thousands and thousands of common scrap able instances that have random poisoning.
  
  source
  - vane@lemmy.world ⁨1⁩ ⁨month⁩ ago
    To be honest bandwidth isn’t a problem because it’s text files. The problem is to optimize network stack for multiple connections because they’re hitting from whole subnets without any delay so literally ddos and cache those html files because at some point CPU becomes bottleneck.
    
    source
    -> View More Comments
chunes@lemmy.world ⁨1⁩ ⁨month⁩ ago

Small quantities of poisoned training data can significantly damage a language model.

Source: trust me bro.

Nightshade tried the same thing and it never worked.

source
- ExLisper@lemmy.curiana.net ⁨1⁩ ⁨month⁩ ago
  Here’s your source: www.anthropic.com/research/small-samples-poison
  
  source
- homes@piefed.world ⁨1⁩ ⁨month⁩ ago
  Night shade did work on older models. Neural models adapted to prevent poisoning.
  
  This is a new approach.
  
  source
  - nullroot@lemmy.world ⁨1⁩ ⁨month⁩ ago
    Ye, nightshade was defeated by a blur and sharpen iirc lol. Still, was a good first step.
    
    source
Lembot_0006@programming.dev ⁨1⁩ ⁨month⁩ ago
Idiots: This new technology is still quite ineffective. Let’s sabotage it’s improvement!

Imbeciles: Yeah!

source
- Stern@lemmy.world ⁨1⁩ ⁨month⁩ ago
  Corpos: Don’t steal our stuff! That’s piracy!
  
  Also corpos: Your stuff? My stuff now.
  
  Bootlickers: Oh my god this shoe polish is delicious.
  
  source
  - Lembot_0006@programming.dev ⁨1⁩ ⁨month⁩ ago
    You should select something: whether you like the current copyright system or not. You can’t do both.
    
    source
    -> View More Comments
  - FauxLiving@lemmy.world ⁨1⁩ ⁨month⁩ ago
    Person: Says a thing
    
    Person 2, who disagrees with the thing: YOU’RE A BOOTLICKER!
    
    Super convincing. I’m sure you’re going to win people over to your position if you scream loud enough.
    
    source
    -> View More Comments
- Disillusionist@piefed.world ⁨1⁩ ⁨month⁩ ago
  AI companies could start, I don’t know- maybe asking for permission to scrape a website’s data for training? Or maybe try behaving more ethically in general? Perhaps then they might not risk people poisoning the data that they clearly didn’t agree to being used for training?
  
  source
  - Lembot_0006@programming.dev ⁨1⁩ ⁨month⁩ ago
    Why should they ask permission to read freely provided data? Nobody’s asking for any permission, but LLM trainers somehow should? And what do you want from them from an ethical standpoint?
    
    source
    -> View More Comments
- bookmeat@lemmynsfw.com ⁨1⁩ ⁨month⁩ ago
  This is a response to a SOCIAL problem, not a technical one.
  
  source
eru@mouse.chitanda.moe ⁨1⁩ ⁨month⁩ ago
i would imagine companies would just filter it out

need some more clever way of hiding it or allow it to be self hosted so that it has various urls

source
- GamingChairModel@lemmy.world ⁨1⁩ ⁨month⁩ ago
  If I am reading this correctly, anyone who wants to use this service can just configure their HTTP server to act as the man in the middle of the request, so that the crawler sees your URL but is retrieving poison fountain content from the poison fountain service.
  
  If so, that means the crawlers wouldn’t be able to filter by URL because the actual handler that responds to the HTTP request doesn’t ever see the canonical URL of the poison fountain.
  
  In other words, the handler is “self hosted” at its own URL while the stream itself comes from the same URL that the crawler never sees.
  
  source
- CileTheSane@lemmy.ca ⁨1⁩ ⁨month⁩ ago
  So it would be effective at preventing your site from being used as training data.
  
  source
FaceDeer@fedia.io ⁨1⁩ ⁨month⁩ ago
Doesn't work, but I guess if it makes people feel better I suppose they can waste their resources doing this.

Modern LLMs aren't trained on just whatever raw data can be scraped off the web any more. They're trained with synthetic data that's prepared by other LLMs and carefully crafted and curated. Folks are still thinking ChatGPT 3 is state of the art here.

source
- Disillusionist@piefed.world ⁨1⁩ ⁨month⁩ ago
  From what I’ve heard, the influx of AI data is one of the reasons actual human data is becoming increasingly sought after. AI training AI has the potential to become a sort of digital inbreeding that suffers in areas like originality and other ineffable human qualities that AI still hasn’t quite mastered.
  
  I’ve also heard that this particular approach to poisoning AI is newer and thought to be quite effective, though I can’t personally speak to its efficacy.
  
  source
  - BagOfHeavyStones@piefed.social ⁨1⁩ ⁨month⁩ ago
    Faults in replication? That can become cancer for humans. AI as well I guess.
    
    source
- Taldan@lemmy.world ⁨1⁩ ⁨month⁩ ago
  Let’s say I believe you. If that’s the case, why are AI companies still scraping everything?
  
  source
  - FaceDeer@fedia.io ⁨1⁩ ⁨month⁩ ago
    Raw materials to inform the LLMs constructing the synthetic data, most likely. If you want it to be up to date on the news, you need to give it that news.
    
    The point is not that the scraping doesn't happen, it's that the data is already being highly processed and filtered before it gets to the LLM training step. There's a ton of "poison" in that data naturally already. Early LLMs like GPT-3 just swallowed the poison and muddled on, but researchers have learned how much better LLMs can be when trained on cleaner data and so they already take steps to clean it up.
    
    source
- XLE@piefed.social ⁨1⁩ ⁨month⁩ ago
  Do you have any basis for this assumption, FaceDeer?
  
  Based on your pro-AI-leaning comments in this thread, I don’t think people should accept defeatist rhetoric at face value.
  
  source
  - FaceDeer@fedia.io ⁨1⁩ ⁨month⁩ ago
    A basic Google search for "synthetic data llm training" will give you lots of hits describing how the process goes these days.
    
    Take this as "defeatist" if you wish, as I said it doesn't really matter. In the early days of LLMs when ChatGPT first came out the strategy for training these things was to just dump as much raw data onto them as possible and hope quantity allowed the LLM to figure something out from it, but since then it's been learned that quality is better than quantity and so training data is far more carefully curated these days. Not because there's "poison" in it, just because it results in better LLMs. Filtering out poison will happen as a side effect.
    
    It's like trying to contaminate a city's water supply by peeing in the river upstream of the water treatment plant drawing from it. The water treatment plant is already dealing with all sorts of contaminants anyway.
    
    source
    -> View More Comments
- KeenFlame@feddit.nu ⁨1⁩ ⁨month⁩ ago
  Ai devalues datasets when it refines, many resources are aimed towards solving the degradation that occurs when ai trains on ai. Gradients become poor and quality follows
  
  source
  - FaceDeer@fedia.io ⁨1⁩ ⁨month⁩ ago
    You're thinking of "model decay", I take it? That's not really a thing in practice.
    
    source
    -> View More Comments
Blackmist@feddit.uk ⁨1⁩ ⁨month⁩ ago
With the amount of AI generated horseshit out there already, they’ve already pissed in the well.

source
nutsack@lemmy.dbzer0.com ⁨1⁩ ⁨month⁩ ago
I don’t think this is a good idea you’re going to in turn poison the knowledge of humanity more than the AI already is doing

source
- phoenixz@lemmy.ca ⁨1⁩ ⁨month⁩ ago
  Nah, AI will already do that automatically because any and all system loses something in inefficiencies. Same like if you put a theoretical 100 miles of gas worth in your tank that turns into 20 in practice because the combustion engine has an efficiency of 30ish%, you have air and tire resistance, etc etc.
  
  AI has the same for information, and what comes out is always a certain fraction of the 100% that went in
  
  Since poisoning the pool makes AI unreliable up to the point where it becomes useless, it has the potential to stop the AI madness. I’d be all for that.
  
  source
zr0@lemmy.dbzer0.com ⁨1⁩ ⁨month⁩ ago
This is just stupid^20

source