Comment

Comment on Do you think Google execs keep a secret un-enshittified version of their search engine and LLM?

partial_accumen@lemmy.world ⁨1⁩ ⁨month⁩ ago

Its also possible we’ve reached the limits of the training data.

This is my thinking too. I don’t know how to solve the problem either because datasets created after about 2022 likely are polluted with LLM results baked in. With even a 95% precision that means 5% hallucination baked into the dataset. I can’t imagine enough grounding is possible to mitigate that. As the years go forward the problem only gets worse because more LLM results will be fed back in as training data.

source

Sort:hotnew top

TropicalDingdong@lemmy.world ⁨1⁩ ⁨month⁩ ago
I mean thats possible, but I’m not as worried about that. Yes it would make future models worse. But its also entirely plausible to just cultivate a better dataset. And even small datasets can be used to make models that are far better at specific tasks than an any generalist llm. If better data is better then the solution is simple: use human labor to cultivate a highly curated high quality dataset. I mean its what we’ve been doing for decades in ML.

I think the bigger issue is that transformers are incredibly inefficient about their use of data. How big of a corpus do you need to feed into an llm to get it to solve a y =mx+b problem? Compare that to a simple neural network or a random forest. For domain specific tasks they’re absurdly inefficient. I do think we’ll see architectural improvements, and while the consequences of improvements has been non-linear, the improvements themselves have been fairly, well, linear.

Before transformers we basically had GAN’s and LSTM’s as the latest and greatest. And before that UNET was the latest and greatest (and I still go back to, often), and before that basic NN’s and random forest. I do think we’ll get some stepwise improvements to machine learning, and we’re about due for some. But its not going to be tittering at the edges. Its going to be something different.

The only thing that I’m truly worried about is that if, even if its unlikely, if you can just 10x the size of an existing transformer (say from 500 billion parameters to 5 trillion, something you would need like a terabyte of vram to even process), if that results in totally new characteristics, in the same way that scaling from 100 million parameters to 10 billion resulted in something that, apparently, understood the rules of language. There are real land mines out there that none of us as individuals have the ability to avoid. But the “poisoning” of the data? If history tells us anything, its that if a capitalist thinks it might be profitable, they’ll throw any amount of human suffering at it to try and accomplish it.

source
SubArcticTundra@lemmy.ml ⁨1⁩ ⁨month⁩ ago
If model collapse is such an issue for LLMs, then why are humans resistant to it? We are largely trained on output created by other humans.

source
- AwesomeLowlander@sh.itjust.works ⁨1⁩ ⁨month⁩ ago
  Are we? Antivax, anti science BS is largely due to Russia poisoning our dataset.
  
  source