AI models collapse when trained on recursively generated data

⁨252⁩ ⁨likes⁩

Submitted ⁨⁨7⁩ ⁨months⁩ ago⁩ by ⁨floofloof@lemmy.ca⁩ to ⁨technology@lemmy.world⁩

https://www.nature.com/articles/s41586-024-07566-y

source

Comments

Sort:hotnew top

BombOmOm@lemmy.world ⁨7⁩ ⁨months⁩ ago
Yep. It leads to a positive feedback loop. They just continue to self-reinforce whatever came out before.

And with increasing amounts of the internet being polluted with AI text output…

source
- Ensign_Crab@lemmy.world ⁨7⁩ ⁨months⁩ ago
  … AI inbreeding.
  
  source
  - skillissuer@discuss.tchncs.de ⁨7⁩ ⁨months⁩ ago
    hapsburgGPT
    
    source
  - Boozilla@lemmy.world ⁨7⁩ ⁨months⁩ ago
    We call it the GRRM model.
    
    source
    -> View More Comments
  - LilaOrchidee@feddit.org ⁨7⁩ ⁨months⁩ ago
    AInbreeding
    
    source
- kevincox@lemmy.ml ⁨7⁩ ⁨months⁩ ago
  To be fair this doesn’t sound much different than your average human using the internet.
  
  source
  - sp3tr4l@lemmy.zip ⁨7⁩ ⁨months⁩ ago
    2024, Reverse Turing Test Challenge:
    
    Can an LLM AI differentiate between human input and LLM AI input?
    
    source
- MagicShel@programming.dev ⁨7⁩ ⁨months⁩ ago
  That seems so obviously predictable.
  
  source
- Even_Adder@lemmy.dbzer0.com ⁨7⁩ ⁨months⁩ ago
  You have to pretty much intentionally give it enough synthetic data to wreck it. OpenAI and Anthropic train their models on generated data to improve them. As long as there’s supervision during training, this isn’t really a problem.
  
  openai.com/…/prover-verifier-games-improve-legibi…
  
  www.anthropic.com/research/claude-character
  
  source
- Tobberone@lemm.ee ⁨7⁩ ⁨months⁩ ago
  Well… Its built on statistics and statistical inference will return to the mean eventually. If all it ever gets to train on is closer and closer to the mean, there will be nothing left to work with. It will all be the average…
  
  source
sp3tr4l@lemmy.zip ⁨7⁩ ⁨months⁩ ago
Holy shit are you telling me…

Garbage In…

= Garbage Out?

source
- FaceDeer@fedia.io ⁨7⁩ ⁨months⁩ ago
  You realize that those "billions of dollars" have actually resulted in a solution to this? "Model collapse" has been known about for a long time and further research figured out how to avoid it. Modern LLMs actually turn out better when they're trained on well-crafted and well-curated synthetic data.
  
  Honestly, everyone seems to assume that machine learning researchers are simpletons who've never used a photocopier before.
  
  source
fubarx@lemmy.ml ⁨7⁩ ⁨months⁩ ago
No shit. People have known about the perils of feeding simulator output back in as input for eons. The variance drops off so you end up with zero new insights and a gradual worsening due to entropy.

source
Zip2@feddit.uk ⁨7⁩ ⁨months⁩ ago
So it’s basically an AI prion disease?

source
- Llewellyn@lemm.ee ⁨7⁩ ⁨months⁩ ago
  No.
  
  source
SlopppyEngineer@lemmy.world ⁨7⁩ ⁨months⁩ ago
Eventually an AI will be developed that can learn with much less data. In the end we don’t need to read the entire internet to get through our education. But, that’s not going to be LLM. No matter how much you tweak LLM models, it won’t get there. It’s like trying to tune a coal fired steam powered car until you can compete in a formula 1 race.

source
- conciselyverbose@sh.itjust.works ⁨7⁩ ⁨months⁩ ago
  Yeah, it’s entirely plausible that LLMs are a small part of the answer as basically the language center of the brain, but the brain is a hell of a lot more complex than that. The language center isn’t your whole brain, and is only loosely connected to actual decision making. It confabulates a lot.
  
  source
  - SlopppyEngineer@lemmy.world ⁨7⁩ ⁨months⁩ ago
    OpenAI stumbled on something that worked and ran with it, and people started proclaiming it to be the answer to everything. The same happened with Deep Learning and every AI invention so far. It’s all just another stepping stone on the way.
    
    source
- Even_Adder@lemmy.dbzer0.com ⁨7⁩ ⁨months⁩ ago
  It’s already happening. A quote from Andrej Karpathy :
  
  Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it’s not even clear how prior LLMs learn anything at all.
  
  source
ConstipatedWatson@lemmy.world ⁨7⁩ ⁨months⁩ ago
You don’t say, Sherlock

source
YeetPics@mander.xyz ⁨7⁩ ⁨months⁩ ago
So do humans if I’m being honest, look at the RNC.

source
lemmy_get_my_coat@lemmy.world ⁨7⁩ ⁨months⁩ ago
Can’t wait

source
FartsWithAnAccent@fedia.io ⁨7⁩ ⁨months⁩ ago
No shit.

source