Comment

Comment on Reddit has a new AI training deal to sell user content

Lmaydev@programming.dev ⁨1⁩ ⁨year⁩ ago

I’d be very surprised if people weren’t already scraping Reddit for this.

source

Sort:hotnew top

NoRodent@lemmy.world ⁨1⁩ ⁨year⁩ ago
I mean, there’s /r/SubSimulatorGPT2 that’s been running for years… Although that one was at least hilarious to read because at that stage the AI was in the sweet spot of being simultaneously coherent while making total lapses in logic.

source
- TexasDrunk@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Didn’t forget incredibly racist on multiple occasions.
  
  source
  - bbkpr@lemmy.world ⁨1⁩ ⁨year⁩ ago
    The AI is what was fed into it 😂
    
    source
Verserk@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
That was the real reason for the API changes last year, apps just got caught in the crossfire.

source
- fuckwit_mcbumcrumble@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Yeah I thought that was pretty well the established conscientious on the thing. People questioning it confuses me honestly.
  
  source
NeatNit@discuss.tchncs.de ⁨1⁩ ⁨year⁩ ago
it’s all but guaranteed. Reminds me of this Computerphile video: youtu.be/WO2X3oZEJOA?t=874 TL;DW: there were “glitch tokens” in GPT (and therefore ChatGPT) which undeniably came from Reddit usernames.

Note, there’s no proof that these reddit usernames were in the training data (and there’s even reasons to assume that they weren’t, watch the video for context) but there’s no doubt that OpenAI already had scraped reddit data at some point prior to training, probably mixed in with all the rest of their text data. I see no reason to assume they completely removed all reddit text before training,

source