Reddit has a new AI training deal to sell user content

Submitted ⁨⁨1⁩ ⁨year⁩ ago⁩ by ⁨L4s@lemmy.world [bot]⁩ to ⁨technology@lemmy.world⁩

https://www.theverge.com/2024/2/17/24075670/reddit-ai-training-license-deal-user-content

Reddit has a new AI training deal to sell user content::Reddit has reportedly made a deal with an unnamed AI company to allow access to its platform’s content for the purposes of AI model training.

source

Comments

Sort:hotnew top

Lmaydev@programming.dev ⁨1⁩ ⁨year⁩ ago
I’d be very surprised if people weren’t already scraping Reddit for this.

source
- NoRodent@lemmy.world ⁨1⁩ ⁨year⁩ ago
  I mean, there’s /r/SubSimulatorGPT2 that’s been running for years… Although that one was at least hilarious to read because at that stage the AI was in the sweet spot of being simultaneously coherent while making total lapses in logic.
  
  source
  - TexasDrunk@lemmy.world ⁨1⁩ ⁨year⁩ ago
    Didn’t forget incredibly racist on multiple occasions.
    
    source
    -> View More Comments
- Verserk@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
  That was the real reason for the API changes last year, apps just got caught in the crossfire.
  
  source
  - fuckwit_mcbumcrumble@lemmy.world ⁨1⁩ ⁨year⁩ ago
    Yeah I thought that was pretty well the established conscientious on the thing. People questioning it confuses me honestly.
    
    source
- NeatNit@discuss.tchncs.de ⁨1⁩ ⁨year⁩ ago
  it’s all but guaranteed. Reminds me of this Computerphile video: youtu.be/WO2X3oZEJOA?t=874 TL;DW: there were “glitch tokens” in GPT (and therefore ChatGPT) which undeniably came from Reddit usernames.
  
  Note, there’s no proof that these reddit usernames were in the training data (and there’s even reasons to assume that they weren’t, watch the video for context) but there’s no doubt that OpenAI already had scraped reddit data at some point prior to training, probably mixed in with all the rest of their text data. I see no reason to assume they completely removed all reddit text before training,
  
  source
ME5SENGER_24@lemmy.world ⁨1⁩ ⁨year⁩ ago
FUCK REDDIT! FUCK U/SPEZ! The Red-exit shall endure, VIVA LA LEMMY!!

source
- Boozilla@lemmy.world ⁨1⁩ ⁨year⁩ ago
  I bet the fuckers will use “deleted” data, too
  
  source
  - General_Effort@lemmy.world ⁨1⁩ ⁨year⁩ ago
    Deleted? You mean made unscrapeable. It’s exclusive to Reddit licensees.
    
    source
    -> View More Comments
  - tinwhiskers@lemmy.world ⁨1⁩ ⁨year⁩ ago
    what about edited?
    
    source
gwildors_gill_slits@lemmy.ca ⁨1⁩ ⁨year⁩ ago
Can’t wait for chatGPT to call me good sir and tell me I win the internet.

source
- Buffalox@lemmy.world ⁨1⁩ ⁨year⁩ ago
  The best answer to you question is “deleted by user.”
  
  source
comrade19@lemmy.world ⁨1⁩ ⁨year⁩ ago
Why is there nothing on reddit about this lol

source
- bobs_monkey@lemm.ee ⁨1⁩ ⁨year⁩ ago
  Mustn’t spook the ~~product~~ users
  
  source
- FartsWithAnAccent@lemmy.world ⁨1⁩ ⁨year⁩ ago
  I’d be surprised if there wasn’t, I don’t think Spez and his cohorts are competent enough to completely suppress all information about it site wide.
  
  source
KingThrillgore@lemmy.ml ⁨1⁩ ⁨year⁩ ago
When spez took away API access, he basically shit on the social contract that offered a fair exchange of free access for the content we fed into reddit. There is no contract. There are no terms. If you use reddit now, you are giving away everything you are to be indexed and mangled by statistics.

We need legislation that tells scrapers what they can access.

source
- General_Effort@lemmy.world ⁨1⁩ ⁨year⁩ ago
  
  We need legislation that tells scrapers what they can access.
  
  What do you hope that would achieve?
  
  Because I can only see this as benefitting Reddit, Facebook, and the like, while screwing over smaller players.
  
  source
VerseAndVermin@lemmy.world ⁨1⁩ ⁨year⁩ ago
Can someone more savvy explain why they couldn’t also scrape what we all say here?

source
- Crack0n7uesday@lemmy.world ⁨1⁩ ⁨year⁩ ago
  They can and do, but they want the training models to come from highly moderated sources otherwise everywhere AI chatbot would be spewing the most racist parts of 4chan because people would train it that way as a joke.
  
  source
  - bigkahuna1986@lemmy.ml ⁨1⁩ ⁨year⁩ ago
    Ah yes, the Microsoft Tay conundrum.
    
    source
    -> View More Comments
- Steak@lemmy.ca ⁨1⁩ ⁨year⁩ ago
  Dick dick pussy cunt cock dick pussy ass shit cunt shit motherfucker shit motherfucker ass tits cunt cock motherfucker shit ass tits motherfucker shit c’mon. Scrape that🔥
  
  source
- Verserk@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
  Anything can, the difference is reddit holds the exclusive rights to user comments on their site, and they’ve chosen to sell it.
  
  source
lvxferre@mander.xyz ⁨1⁩ ⁨year⁩ ago
For anyone looking for a gibberish generator to replace their Reddit content with, here’s one.

For automatic edition I’m not sure on what people can use nowadays; back then just before the APIcalypse I’ve used power delete suite, I’m not sure if it still works and I’m not creating a Reddit account just to test it out.

source
- greaprr@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
  Not that I’m against telling Reddit to fuck off in no uncertain terms, but won’t providing this kind of poisoning to AI training just make it more resilient to exactly this kind of thing?
  
  source
  - lvxferre@mander.xyz ⁨1⁩ ⁨year⁩ ago
    I don’t think so. It’s really hard to sort the poison out of the data, unless you actually have enough reading comprehension to know that it’s gibberish - humans do, bots don’t. And even if they discard 80% of the poison, the 20% there are already screwing with the model.
    
    And if prevent you from editing your posts/comments, but that would cause an uproar.
    
    source
General_Effort@lemmy.world ⁨1⁩ ⁨year⁩ ago
They say it’s $60 million on an annualized basis. I wonder who’d pay that, given that you can probably scrape it for free.

Maybe it’s the AI act in the EU. That might cause trouble in that regard. The US is seeing a lot of rent-seeker PR, too, of course. That might cause some to hedge their bets.

Maybe some people had not realized that yet, but limiting fair use does not just benefit the traditional media corporations but also the likes of Reddit, Facebook, Apple, etc. Making “robots.txt” legally binding would only benefit the tech companies.

source
autotldr@lemmings.world [bot] ⁨1⁩ ⁨year⁩ ago
This is the best summary I could come up with:

Reddit will let “an unnamed large AI company” have access to its user-generated content platform in a new licensing deal, according to Bloomberg yesterday.

The deal, “worth about $60 million on an annualized basis,” the outlet writes, could still change as the company’s plans to go public are still in the works.

The news also follows an October story that Reddit had threatened to cut off Google and Bing’s search crawlers if it couldn’t make a training data deal with AI companies.

Last year, it successfully stonewalled its way out of the biggest protest in its history after changes to its third-party API access pricing caused developers of the most popular Reddit apps to shut down.

As Bloomberg writes, Reddit’s year-over-year revenue was up by 20 percent by the end of 2023, but it was still $200 million shy of a $1 billion target it had set two years prior.

The company was reportedly advised to seek a $5 billion valuation when it opens up for public investment, which is expected to happen in March.

The original article contains 346 words, the summary contains 175 words. Saved 49%. I’m a bot and I’m open source!

source