Comment

Comment on OpenAI strikes Reddit deal to train its AI on your posts

So they filled reddit with bot generated content, and now they’re selling back the same stuff likely to the company who generated most of it.

At what point can we call an AI inbred?

source

Sort:hotnew top

orca@orcas.enjoying.yachts ⁨9⁩ ⁨months⁩ ago
This is actually a thing. It’s called “Model Collapse”. You can read about it here.

source
- FaceDeer@fedia.io ⁨9⁩ ⁨months⁩ ago
  "Model collapse" can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.
  
  source
  - Ghostalmedia@lemmy.world ⁨9⁩ ⁨months⁩ ago
    A model trained on jokes about bacon, narwhals, and rage comics.
    
    source
    FaceDeer@fedia.io ⁨9⁩ ⁨months⁩ ago
    By "old archives" I mean everything from 2022 and earlier.
    
    source
    -> View More Comments
  - mint_tamas@lemmy.world ⁨9⁩ ⁨months⁩ ago
    That paper is yet to be peer reviewed or released. I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?
    
    source
    barsoap@lemm.ee ⁨9⁩ ⁨months⁩ ago
    
    That paper is yet to be peer reviewed or released.
    
    Never doing either (release as in submit to journal) isn’t uncommon in maths, physics, and CS. Not to say that it won’t be released but it’s not a proper standard to measure papers by.
    
    I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?
    
    Quoth:
    
    If each linear model is instead fit to the generate targets of all the preceding linear models i.e. data accumulate, then the test squared error has a finite upper bound, independent of the number of iterations. This suggests that data accumulation might be a robust solution for mitigating model collapse.
    
    Emphasis on “finite upper bound, independent of the number of iterations” by doing nothing more than keeping the non-synthetic data around each time you ingest new synthetic data. This is an empirical study so of course it’s not proof you’ll have to wait for theorists to have their turn for that one, but it’s darn convincing and should henceforth be the null hypothesis.
    
    source
    -> View More Comments
- noodlejetski@lemm.ee ⁨9⁩ ⁨months⁩ ago
  I prefer “Habsburg AI”.
  
  source
restingboredface@sh.itjust.works ⁨9⁩ ⁨months⁩ ago
I wonder if Open AI or any of the other firms have thought to put in any kind of stipulations about monitoring and moderating reddit content to reduce ai generated posts and reduce risk of model collapse.

Anybody who’s looked at reddit in the past 2 years especially has seen the impact of ai pretty clearly. If I was running open ai I wouldn’t want that crap contaminating my models.

source