I prefer “Habsburg AI”.
Comment on OpenAI strikes Reddit deal to train its AI on your posts
orca@orcas.enjoying.yachts 6 months agoThis is actually a thing. It’s called “Model Collapse”. You can read about it here.
noodlejetski@lemm.ee 6 months ago
FaceDeer@fedia.io 6 months ago
"Model collapse" can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.
Ghostalmedia@lemmy.world 6 months ago
A model trained on jokes about bacon, narwhals, and rage comics.
FaceDeer@fedia.io 6 months ago
By "old archives" I mean everything from 2022 and earlier.
BakerBagel@midwest.social 6 months ago
But there were still bots making shit up back then. r/SubredditSimulator was pretty popular for awhile, and repost and astroturfing bots were a problem form decades on Reddit.
Ghostalmedia@lemmy.world 6 months ago
I SAID RAGE COMICS
mint_tamas@lemmy.world 6 months ago
That paper is yet to be peer reviewed or released. I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?
barsoap@lemm.ee 6 months ago
Never doing either (release as in submit to journal) isn’t uncommon in maths, physics, and CS. Not to say that it won’t be released but it’s not a proper standard to measure papers by.
Quoth:
Emphasis on “finite upper bound, independent of the number of iterations” by doing nothing more than keeping the non-synthetic data around each time you ingest new synthetic data. This is an empirical study so of course it’s not proof you’ll have to wait for theorists to have their turn for that one, but it’s darn convincing and should henceforth be the null hypothesis.
mint_tamas@lemmy.world 6 months ago
Peer review, for all its flaws is a good minimum before a paper is worth taking seriously.
In your original comment you said tha model collapse can be easily avoided with this technique, which is notably different from it being mitigated. I’m not saying that these findings are not useful, just that you are overselling them a bit with this wording.