How Tech Giants Cut Corners to Harvest Data for A.I.

Submitted ⁨⁨11⁩ ⁨months⁩ ago⁩ by ⁨ForgottenFlux@lemmy.world⁩ to ⁨technology@lemmy.world⁩

https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html

If the linked article has a paywall, you can access this archived version instead: archive.ph/aeQhP

OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems.

source

Comments

Sort:hotnew top

autotldr@lemmings.world [bot] ⁨11⁩ ⁨months⁩ ago
This is the best summary I could come up with:

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by The Times.

The companies’ actions illustrate how online information — news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts and movie clips — has increasingly become the lifeblood of the booming A.I.

Google and Meta, which have billions of users who produce search queries and social media posts every day, were largely limited by privacy laws and their own policies from drawing on much of that content for A.I.

They had mined the computer code repository GitHub, vacuumed up databases of chess moves and drawn on data describing high school tests and homework assignments from the website Quizlet.

Ahmad Al-Dahle, Meta’s vice president of generative A.I., told executives that his team had used almost every available English-language book, essay, poem and news article on the internet to develop a model, according to recordings of internal meetings, which were shared by an employee.

One employee recounted a separate discussion about copyrighted data with senior executives including Chris Cox, Meta’s chief product officer, and said no one in that meeting considered the ethics of using people’s creative works.

The original article contains 3,313 words, the summary contains 213 words. Saved 94%. I’m a bot and I’m open source!

source
- nahuse@sh.itjust.works ⁨11⁩ ⁨months⁩ ago
  This article is, just a little bit, about you, and those who came after you. That’s neat!
  
  source
Rolando@lemmy.world ⁨11⁩ ⁨months⁩ ago
This is behind a paywall, so we’ll never know.

source
- TWeaK@lemm.ee ⁨11⁩ ⁨months⁩ ago
  
  If the linked article has a paywall, you can access this archived version instead: archive.ph/aeQhP
  
  source
0xvalentin@lemmy.sdf.org ⁨11⁩ ⁨months⁩ ago
I wonder when google is going to tap into Gmail data of users (if they do not already). They must have trillions of english messages and they already filtered spam. Additionally, it’s hard to ever prove that they did it.

Maybe it doesn’t make for high quality data though, not sure…

source
- onion@feddit.de ⁨11⁩ ⁨months⁩ ago
  I mean if you want the ai to be able to write emails and understand lingo like ‘attachement’, ‘subject’ and ‘bcc’, you pribably have to feed it emails
  
  source
TWeaK@lemm.ee ⁨11⁩ ⁨months⁩ ago
On AI training itself on AI produced content:

“As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, everything will be fine,” Mr. Altman said.

source