Garbage in; Garbage out.
AI trained on AI garbage spits out AI garbage.
Submitted 3 months ago by ModerateImprovement@sh.itjust.works to technology@lemmy.world
Comments
Crazyslinkz@lemmy.world 3 months ago
_haha_oh_wow_@sh.itjust.works 3 months ago
Shit-fueled ouroboros
lemmeout@lemm.ee 3 months ago
You can’t explain it!
BluesF@lemmy.world 3 months ago
Recycle the garbage that comes out… Still more garbage out.
lvxferre@mander.xyz 3 months ago
Model degeneration is an already well-known phenomenon. The article already explains well what’s going on so I won’t go into details, but note how this happens because the model does not understand what it is outputting - it’s looking for patterns, not for the meaning conveyed by said patterns.
Frankly at this rate might as well go with a neuro-symbolic approach.
CeeBee_Eh@lemmy.world 3 months ago
The issue with your assertion is that people don’t actually work a similar way. Have you ever met someone who was clearly taught "garbage’?
lvxferre@mander.xyz 3 months ago
The issue with your assertion is that people don’t actually work a similar way.
I’m talking about LLMs, not about people.
PenisDuckCuck9001@lemmynsfw.com 3 months ago
I’m autistic. Sometimes I feel like an ai bot spewing out garbage in social situations.
Catoblepas@lemmy.blahaj.zone 3 months ago
AI making itself sick and worthless after flooding the internet with trash just gives me a warm glow.
tal@lemmy.today 3 months ago
Well, you’ve got a timestamped copy of much of the Web that existed up until latent-diffusion models at archive.org. That may not give you access to newer information, but it’s a pretty whopping big chunk of data to work with.
cordlesslamp@lemmy.today 3 months ago
Oh no, the AI are inbreeding.
kromem@lemmy.world 3 months ago
I’d be very wary of extrapolating too much from this paper.
The past research along these lines found that a mix of synthetic and organic data was better than organic alone, and a caveat for all the research to date is that they are using shitty cheap models where there’s a significant performance degrading in the synthetic data as compared to SotA models, where other research has found notable improvements to smaller models from synthetic data from the SotA.
Basically this is only really saying that AI models across multiple types from a year or two ago in capabilities recursively trained with no additional organic data will collapse.
It’s not representative of real world or emerging conditions.
Anarki_@lemmy.blahaj.zone 3 months ago
⢀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⣠⣤⣶⣶ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⢰⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⣀⣀⣾⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⡏⠉⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⣿ ⣿⣿⣿⣿⣿⣿⠀⠀⠀⠈⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠛⠉⠁⠀⣿ ⣿⣿⣿⣿⣿⣿⣧⡀⠀⠀⠀⠀⠙⠿⠿⠿⠻⠿⠿⠟⠿⠛⠉⠀⠀⠀⠀⠀⣸⣿ ⣿⣿⣿⣿⣿⣿⣿⣷⣄⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⣴⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⢰⣹⡆⠀⠀⠀⠀⠀⠀⣭⣷⠀⠀⠀⠸⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠈⠉⠀⠀⠤⠄⠀⠀⠀⠉⠁⠀⠀⠀⠀⢿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⢾⣿⣷⠀⠀⠀⠀⡠⠤⢄⠀⠀⠀⠠⣿⣿⣷⠀⢸⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡀⠉⠀⠀⠀⠀⠀⢄⠀⢀⠀⠀⠀⠀⠉⠉⠁⠀⠀⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿
KevonLooney@lemm.ee 3 months ago
provenance requires some way to filter the internet into human-generated and AI-generated content, which hasn’t been cracked yet
It doesn’t need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a “truth score” for each.
We don’t teach children to read by just handing them random tweets. We give them books that are made specifically for children. Our filtering mechanism for good / bad content is very robust for humans. Why can’t AI just read every piece of “classic literature”, famous speeches, popular books, good TV and movie scripts, etc?
lvxferre@mander.xyz 3 months ago
It doesn’t need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a “truth score” for each.
That isn’t enough because the model isn’t able to reason.
I’ll give you an example. Suppose that you feed the model with both sentences:
- Cats have fur.
- Birds have feathers.
Both sentences are true. And based on vocabulary of both, the model can output the following sentences:
- Cats have feathers.
- Birds have fur.
Both are false but the model doesn’t “know” it. All that it knows is that “have” is allowed to go after both “cats” and “birds”, and that both “feathers” and “fur” are allowed to go after “have”.
KevonLooney@lemm.ee 3 months ago
It’s not just a predictive text program. That’s been around for decades. That’s a common misconception.
As I understand it, it uses statistics from the whole text to create new text. It would be very rare to output “cats have feathers” because that phrase doesn’t ever appear in the training data. Both words “have feathers” never follow “cats”.
CeeBee_Eh@lemmy.world 3 months ago
Both sentences are true. And based on vocabulary of both, the model can output the following sentences:
- Cats have feathers.
- Birds have fur
This is not how the models are trained or work.
Both are false but the model doesn’t “know” it. All that it knows is that “have” is allowed to go after both “cats” and “birds”, and that both “feathers” and “fur” are allowed to go after “have”.
Demonstrably false. This isn’t how LLMs are trained or built.
Just considering the contextual relationships between word embeddings that are created during training is evidence enough. Those relationships from the multi-vector fields are an emergent property that doesn’t exist in the dataset.
If you want a better understanding of what I just said, take a look at this Computerphile video from four years ago. And this came out before the LLM hype and before ChatGPT 3, which was the big leap in LLMs.
Zos_Kia@lemmynsfw.com 3 months ago
That’s what smaller models do, but it doesn’t yield great performance because there’s only so much stuff available. To get to gpt4 levels you need a lot more data, and to break the next glass ceiling you’ll need even more.
KevonLooney@lemm.ee 3 months ago
Then these models are stupid. Humans don’t start as a blank slate. They have an inherent aptitude for language and communication. These models should start out with basics of language, so they don’t have to learn it from the ground up. That’s the next step. Right now they’re just well read idiots.
downpunxx@fedia.io 3 months ago
GIGO
SkaveRat@discuss.tchncs.de 3 months ago
People are already comparing older content with Low Background Steel, as it’s uncontaminated
FaceDeer@fedia.io 3 months ago
And they're overlooking that radionuclide contamination of steel actually isn't much of a problem any more, since the surge in background radionuclides caused by nuclear testing peaked in 1963 and has since gone down almost back to the original background level again.
I guess it's still a good analogy, though. People bring up Low Background Steel because they think radionuclide contamination is an unsolved problem (despite it having been basically solved), and they bring up "model decay" because they think it's an unsolved problem (despite it having been basically solved). It's like newspaper stories, everyone sees the big scary front page headline but nobody pays attention to the little block of text retracting it on page 8.
superminerJG@lemmy.world 3 months ago
News at 11.
FlashZordon@lemmy.world 3 months ago
The AI art is inbreeding.
TheReturnOfPEB@reddthat.com 3 months ago
certainly at least a downvote to free will
Andromxda@lemmy.dbzer0.com 3 months ago
Water is wet
cows_are_underrated@feddit.org 3 months ago
Isbit wet or does it make other things wet?
sundray@lemmus.org 3 months ago
AI writing, scraped by AI, producing more AI writing…
So not “gray goo” exactly, but “gray slop”?
werefreeatlast@lemmy.world 3 months ago
Maybe we can use it to train the other AIs to help ourselves.
MonkderVierte@lemmy.ml 3 months ago
Woah, that was fast.
_haha_oh_wow_@sh.itjust.works 3 months ago
interdasting
Madrigal@lemmy.world 3 months ago
“On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.” - Charles Babbage
bionicjoey@lemmy.ca 3 months ago
The business people adopting AI: “who cares what it’s trained on? It’s intelligent right? It’ll just sort through the garbage and magically come up with the right answers to everything”
RecluseRamble@lemmy.dbzer0.com 3 months ago
Not so hard to imagine given that these people have always seen technical systems as magic.
CookieOfFortune@lemmy.world 3 months ago
Of course modern UX design is very much based on getting the right answer with the wrong inputs (autocorrect, etc).
lennivelkant@discuss.tchncs.de 3 months ago
I believe Robustness was the term I learned years ago: the ability of a system to gracefully handle user error, make it easy to recover from or fix, clearly communicate what was wrong etc.
Of course, nothing is ever perfect and humans are very creative at fucking up, and a lot of companies don’t seem to take UX too seriously. Particularly when the devs get tunnel vision and forget about user error being a thing…