In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

⁨525⁩ ⁨likes⁩

Submitted ⁨⁨11⁩ ⁨months⁩ ago⁩ by ⁨ylai@lemmy.ml⁩ to ⁨technology@lemmy.world⁩

https://futurism.com/video-openai-cto-sora-training-data

source

Comments

Sort:hotnew top

redditReallySucks@lemmy.dbzer0.com ⁨11⁩ ⁨months⁩ ago
Image

I hope this is gonna become a new meme template

source
- driving_crooner@lemmy.eco.br ⁨11⁩ ⁨months⁩ ago
  She looks like she just talked to the waitress about a fake rule in eating nachos and got caught up by her date.
  
  source
- whoisearth@lemmy.ca ⁨11⁩ ⁨months⁩ ago
  Coffeezilla had a video in his void where he plays this back a few times. It’s hilarious seeing the guilt without stating it.
  
  source
Fisk400@feddit.nu ⁨11⁩ ⁨months⁩ ago
They know what they fed the thing. Not backing up their own training data would be insane. They are not insane, just thieves

source
- echodot@feddit.uk ⁨11⁩ ⁨months⁩ ago
  Everyone says this but the truth is copyright law has been unfit for purpose for well over 30 years now. And the lords were written no one expected something like the internet to ever come along and they certainly didn’t expect something like AI. We can’t just keep applying the same old copyright laws to new situations when they already don’t work.
  
  I’m sure they did illegally obtain the work but is that necessarily a bad thing? For example they’re not actually making that content available to anyone so if I pirate a movie and then only I watch it, I don’t think anyone would really think I should be arrested for that, so why is it unacceptable for them but fine for me?
  
  source
  - oKtosiTe@lemmy.world ⁨11⁩ ⁨months⁩ ago
    
    if I pirate a movie and then only I watch it, I don’t think anyone would really think I should be arrested for that
    
    There are definitely people out there that think you should be arrested for that.
    
    source
    -> View More Comments
  - rottingleaf@lemmy.zip ⁨11⁩ ⁨months⁩ ago
    That is a bad thing if they want to be exempt from the law because they are doing a big, very important thing, and we shouldn’t.
    
    The copyright laws are shit, but applying them selectively is orders of magnitude worse.
    
    source
  - A_Very_Big_Fan@lemmy.world ⁨11⁩ ⁨months⁩ ago
    
    if I pirate a movie and then only I watch it, I don’t think anyone would really think I should be arrested for that, so why is it unacceptable for them but fine for me?
    
    Because it’s more analogous to watching a video being broadcasted outdoors in the public, or looking at a mural someone painted on a wall, and letting it inform your creative works going forward. Not even recording it, just looking at it.
    
    As far as we know, they never pirated anything. What we do know is it was trained on data that literally anybody can go out and look at yourself and have it inform your own work. If they’re out here torrenting a bunch of movies they don’t own or aren’t licencing, then the argument against them has merit. But until then, I think all of this is a bunch of AI hysteria over some shit humans have been doing since the first human created a thing.
    
    source
    -> View More Comments
  - exanime@lemmy.today ⁨11⁩ ⁨months⁩ ago
    Because the actual comparison is that you stole ALL movies, started your own Netflix with them and are lining up to literally make billions by taking the jobs of millions of people, including those you stole from
    
    source
    -> View More Comments
  - GiveMemes@jlai.lu ⁨11⁩ ⁨months⁩ ago
    Ok but training an ai is not equivalent to watching a movie. It’s more like putting a game on one of those 300 games in one DS cartridges back in the day.
    
    source
    -> View More Comments
- VirtualOdour@sh.itjust.works ⁨11⁩ ⁨months⁩ ago
  That’s really not how it works though, it’s a web crawler they’re not going to download the whole internet
  
  And the reason they don’t is it would actually potentially be copywrite infringement in some cases where as what they do legally isn’t (no matter how much people wish the law was set based on their emotions)
  
  source
_haha_oh_wow_@sh.itjust.works ⁨11⁩ ⁨months⁩ ago
Gee, seems like something a CTO would know. I’m sure she’s not just lying, right?

source
- Bogasse@lemmy.ml ⁨11⁩ ⁨months⁩ ago
  And on the other hand it is a very obvious question to expect. If you have something hide how on the world are you not prepared for this question !? 🤡
  
  source
- VirtualOdour@sh.itjust.works ⁨11⁩ ⁨months⁩ ago
  It’s a question that is based on a purposeful misunderstanding of the technology, it’s like expecting a bee keeper to know each bees name and bedtime. Really it’s like asking a bricklayer where each brick came from in the pile, He can tell you the batch but not going to know this brick came from the forth row of the sixth pallet, two from the left. There is no reason to remember that it’s not important to anyone.
  
  The don’t log it because it would take huge amounts of resources and gain nothing.
  
  source
  - zaphod@lemmy.ca ⁨11⁩ ⁨months⁩ ago
    What?
    
    Compiling quality datasets is enormously challenging and labour intensive. OpenAI absolutely knows the providence of the data they train on as it’s part of their secret sauce. And there’s no damn way their CTO won’t have a broad strokes understanding of the origins of their datasets.
    
    source
  - Guntrigger@feddit.ch ⁨11⁩ ⁨months⁩ ago
    [Citation needed]
    
    source
- Hotzilla@sopuli.xyz ⁨11⁩ ⁨months⁩ ago
  To be fair, these datasets are one of their biggest competitive edge. But saying in to interviewer “I cannot tell you”, is not very nice, so you can take the americal politician approach and say “I don’t know/remember” which you cannot ever be hold accountable for.
  
  source
phoneymouse@lemmy.world ⁨11⁩ ⁨months⁩ ago
There is no way in hell it isn’t copyrighted material.

source
- abhibeckert@lemmy.world ⁨11⁩ ⁨months⁩ ago
  Every video ever created is copyrighted.
  
  The question is — did they have a license? Did they even need a license? The law is unclear.
  
  source
  - Kazumara@feddit.de ⁨11⁩ ⁨months⁩ ago
    Don’t downvote this guy. He’s mostly right. Creative works have copyright protections from the moment they are created. The relevant question is indeed if they have the relevant permissions for their use, not wether it had protections in the first place.
    
    Maybe some surveillance camera footage is not sufficiently creative to get protections, but that’s hardly going to be good for machine reinforcement learning.
    
    source
  - iknowitwheniseeit@lemmynsfw.com ⁨11⁩ ⁨months⁩ ago
    There are definitely non copyrighted videos! Both old videos (all still black and white I think) and also things released into the public domain by copyright holders.
    
    But for sure that’s a very small subset of videos.
    
    source
Buttons@programming.dev ⁨11⁩ ⁨months⁩ ago
If I were the reporter my next question would be:

“Do you feel that not knowing the most basic things about your product reflects on your competence as CTO?”

source
- ForgotAboutDre@lemmy.world ⁨11⁩ ⁨months⁩ ago
  Hilarious, but if the reporter asked this they would find it harder to get invites to events. Which is a problem for journalists. Unless your very well regarded for your journalism, you can’t push powerful people without risking your career.
  
  source
  - aniki@lemm.ee ⁨11⁩ ⁨months⁩ ago
    boofuckingwoo. Reporters are not supposed to be friends with the people they are writing about.
    
    source
    -> View More Comments
  - Abnorc@lemm.ee ⁨11⁩ ⁨months⁩ ago
    That, and the reporter is there to get information, not mess with and judge people. Asking that sort of question is really just an attack. We can leave it to commentators and ourselves for judge people.
    
    source
    -> View More Comments
- RatBin@lemmy.world ⁨11⁩ ⁨months⁩ ago
  Also about this line:
  
  Others, meanwhile, jumped to Murati’s defense, arguing that if you’ve ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.
  
  No I am not fine. When I wrote that stuff and those researches in old phpbb forums I did not do it with the knowledge of a future machine learning system eating it up without my consent. I never gave consent for that despite it being publicly available, because this would be a designation of use that wouldn’t exist back than. Many other things are also publicly available, but some a re copyrighted, on the same basis: you can publish and share content upon conditions that are defined by the creator of the content. What’s that, when I use zlibrary I am evil for pirating content but openai can do it just fine due to their huge wallets? Guess what, this will eventually creating a crisis of trust, a tragedy of the commons if you will when enough ai generated content will build the bulk of your future Internet search! Do we even want this?
  
  source
CosmoNova@lemmy.world ⁨11⁩ ⁨months⁩ ago
I almost want to believe they legitimately do not care they‘re doing a gigantic data and labour heist but the truth is they know exactly what they‘re doing.

source
- laxe@lemmy.world ⁨11⁩ ⁨months⁩ ago
  Of course they know what they’re doing. Everybody knows this, how could they be the only ones that don’t?
  
  source
- Bogasse@lemmy.ml ⁨11⁩ ⁨months⁩ ago
  Yeah, the fact that AI progress just relies on “we will make so much money that no lawsuit will consequently alter our growth” is really infuriating. The fact that general audience apparently doesn’t care is even more infuriating.
  
  source
- A_Very_Big_Fan@lemmy.world ⁨11⁩ ⁨months⁩ ago
  Look guys! I’m stealing from Tolkien!
  
  source
  - toddestan@lemmy.world ⁨11⁩ ⁨months⁩ ago
    I’d say not really, Tolkien was a writer, not an artist.
    
    What you are doing is violating the trademark Middle-Earth Enterprises has on the Gandalf character.
    
    source
    -> View More Comments
  - Guntrigger@feddit.ch ⁨11⁩ ⁨months⁩ ago
    I don’t think anyone’s going to pay for your version of ChatGPT
    
    source
stackPeek@lemmy.world ⁨11⁩ ⁨months⁩ ago
This tellls you so much what kind of company OpenAI is

source
- webghost0101@sopuli.xyz ⁨11⁩ ⁨months⁩ ago
  An Intelligence piracy company?
  
  source
- jaemo@sh.itjust.works ⁨11⁩ ⁨months⁩ ago
  It also tells us how hypocritical we all are since absolutely every single one of us would make the same decisions they have if we were in their shoes. This shit was one bajillion percent inevitable; we are in a river and have been since we tilled soil with a plough in the Nile valley millennia ago.
  
  source
  - adrian783@lemmy.world ⁨11⁩ ⁨months⁩ ago
    most of us would never be in their shoes because most of us are not sociopathic techbros
    
    source
    -> View More Comments
  - whoisearth@lemmy.ca ⁨11⁩ ⁨months⁩ ago
    Speak for yourself. Were I in their shoes no I would not. But then again my company wouldn’t be as big as theirs for that reason.
    
    source
- wabafee@lemmy.world ⁨11⁩ ⁨months⁩ ago
  Half open or half close?
  
  source
Bleach7297@lemmy.ca ⁨11⁩ ⁨months⁩ ago
Did they intentionally chose a picture where she looks like she’s morphing into Elon?

source
- rab@lemmy.ca ⁨11⁩ ⁨months⁩ ago
  I was thinking mads mikkelssen
  
  source
  - billwashere@lemmy.world ⁨11⁩ ⁨months⁩ ago
    Well after just finishing Death Stranding, I can’t unsee that.
    
    source
- HaywardT@lemmy.sdf.org ⁨11⁩ ⁨months⁩ ago
  I suspect so. It is a very slanted article.
  
  source
anon_8675309@lemmy.world ⁨11⁩ ⁨months⁩ ago
CTO should definitely know this.

source
- ItsMeSpez@lemmy.world ⁨11⁩ ⁨months⁩ ago
  They do know this. They’re avoiding any legal exposure by being vague.
  
  source
- blazeknave@lemmy.world ⁨11⁩ ⁨months⁩ ago
  I feel like at their scale, if there’s going to be a figure head marketable CTO, it’s going to be this company. If not, you’re right, and she’s lying lol
  
  source
- turkishdelight@lemmy.ml ⁨11⁩ ⁨months⁩ ago
  Of course she knows it. She just doesn’t want to get sued.
  
  source
andrew_bidlaw@sh.itjust.works ⁨11⁩ ⁨months⁩ ago
Funny she didn’t talked it out with lawyers before that. That’s a bad way to answer that.

source
- driving_crooner@lemmy.eco.br ⁨11⁩ ⁨months⁩ ago
  Or she talked and the lawyers told her to pretend ignorance.
  
  source
  - QuaternionsRock@lemmy.world ⁨11⁩ ⁨months⁩ ago
    It probably means that they don’t scrape and preprocess training data in house. She knows they get it from a garden variety of underpaid contractors, but she doesn’t know the specific data sources beyond the stipulations of the contract (“publicly available or licensed”), and she probably doesn’t even know that for certain.
    
    source
    -> View More Comments
  - andrew_bidlaw@sh.itjust.works ⁨11⁩ ⁨months⁩ ago
    Maybe, but it sounds very weak.
    
    source
    -> View More Comments
TheObviousSolution@lemm.ee ⁨11⁩ ⁨months⁩ ago
Then wipe it out and start again once you have where your data is coming from sorted out. Are we acting like you having built datacenter pack full of NVIDIA processors just for this sort of retraining? They are choosing to build AI without proper sourcing, that’s not an AI limitation.

source
IvanOverdrive@lemm.ee ⁨11⁩ ⁨months⁩ ago
REPORTER: Where does your data come from?

CTO: Bitch, are you trying to get me sued?

source
PanArab@lemmy.world ⁨11⁩ ⁨months⁩ ago
So plagiarism?

source
- HaywardT@lemmy.sdf.org ⁨11⁩ ⁨months⁩ ago
  I don’t think so. They aren’t reproducing the content.
  
  I think the equivalent is you reading this article, then answering questions about it.
  
  source
  - A_Very_Big_Fan@lemmy.world ⁨11⁩ ⁨months⁩ ago
    Idk why this is such an unpopular opinion. I don’t need permission from an author to talk about their book, or permission from a singer to parody their song. I’ve never heard any good arguments for why it’s a crime to automate these things.
    
    I mean hell, we have an LLM bot in this comment section that took the article and spat 27% of it back out verbatim, yet nobody is pissing and moaning about it “stealing” the article.
    
    source
    -> View More Comments
  - myrrh@ttrpg.network ⁨11⁩ ⁨months⁩ ago
    …with the prevalence of clickbaity bottom-feeder new sites out there, i’ve learned to avoid TFAs and await user summaries instead…
    
    source
    -> View More Comments
  - Linkerbaan@lemmy.world ⁨11⁩ ⁨months⁩ ago
    Actually neural networks verbatim reproduce this kind of content when you ask the right question such as “finish this book” and the creator doesn’t censor it out well.
    
    It uses an encoded version of the source material to create “new” material.
    
    source
    -> View More Comments
PoliticallyIncorrect@lemmy.world ⁨11⁩ ⁨months⁩ ago
Watching a video or reading an article by a human isn’t copyright infringement, why then if an “AI” do it then it is? I believe the copyright infringement it’s made by the prompt so by the user not the tool.

source
- echo64@lemmy.world ⁨11⁩ ⁨months⁩ ago
  If you read an article, then copy parts of that article into a new article, that’s copyright infringement. Same with ais.
  
  source
  - anlumo@lemmy.world ⁨11⁩ ⁨months⁩ ago
    Depends on how much is copied, if it’s a small amount it’s fair use.
    
    source
    -> View More Comments
- Drewelite@lemmynsfw.com ⁨11⁩ ⁨months⁩ ago
  This is what people fundamentally don’t understand about intelligence, artificial or otherwise. People feel like their intelligence is 100% “theirs”. While I certainly would advocate that a person owns their intelligence, It didn’t spawn from nothing.
  
  You’re standing on the shoulders of everyone that came before you. You take a prehistoric man or an alien that hasn’t had any of the same experiences you’ve had, they won’t be able to function in this world. It’s not because they are any dumber than you. It’s because you absorbed the hive mind of the society you live in. Everyone’s racing to slap their brand on stuff to copyright it to get ahead and carve out their space.
  
  “No you can’t tell that story, It’s mine.” “That art is so derivative.”
  
  But copyright was only meant to protect something for a short period in order to monetize it; to adapt the value of knowledge for our capital market. Our world can’t grow if all knowledge is owned forever and isn’t able to be used when even THINKING about new ideas.
  
  ANY VERSION OF INTELLIGENCE YOU WOULD WANT TO INTERACT WITH MUST CONSUME OUR KNOWLEDGE AND PRODUCE TRANSFORMATIONS OF IT.
  
  That’s all you do.
  
  Imagine how useless someone would be who’d never interacted with anything copyrighted, patented, or trademarked.
  
  source
  - raspberriesareyummy@lemmy.world ⁨11⁩ ⁨months⁩ ago
    That’s not a very agreeable take. Just get rid of patents and copyrights altogether and your point dissolves itself into nothing. The core difference being derivative works by humans can respect the right to privacy or original creators.
    
    Deep learning bullshit software however will just regurgitate creator’s contents, sometimes unrecognizable, but sometimes outright steal their likeness or individual style to create content that may be associated with the original creators.
    
    what you are in effect doing, is likening learning from the ideas of others to a deep learning “AI” using images for creating revenge porn, to give a drastic example.
    
    source
    -> View More Comments
  - rottingleaf@lemmy.zip ⁨11⁩ ⁨months⁩ ago
    Yes, so how come all these arguments were not popular before the current hype about text generators?
    
    Have some integrity.
    
    source
    -> View More Comments
- topinambour_rex@lemmy.world ⁨11⁩ ⁨months⁩ ago
  What does this human is going to do with this reading ? Are they going to produce something by using part of this book or this article ?
  
  If yes, that’s copyright infringement.
  
  source
- uninvitedguest@lemmy.ca ⁨11⁩ ⁨months⁩ ago
  When a school professor “prompts” you to write an essay and you, the"tool" go consume copyrighted material and plagiarize it in the production of your essay is the infringement made by the professor?
  
  source
  - PoliticallyIncorrect@lemmy.world ⁨11⁩ ⁨months⁩ ago
    If you quote the sources and write it with your own words I believe it isn’t, AFAIK “AI” already do that.
    
    source
    -> View More Comments
- Prandom_returns@lemm.ee ⁨11⁩ ⁨months⁩ ago
  Because it’s software.
  
  source
  - Drewelite@lemmynsfw.com ⁨11⁩ ⁨months⁩ ago
    How do you expect people will create AI if it can’t do the things we do, when “doing the things we do” is the whole point?
    
    source
    -> View More Comments
autotldr@lemmings.world [bot] ⁨11⁩ ⁨months⁩ ago
This is the best summary I could come up with:

Mira Murati, OpenAI’s longtime chief technology officer, sat down with The Wall Street Journal’s Joanna Stern this week to discuss Sora, the company’s forthcoming video-generating AI.

It’s a bad look all around for OpenAI, which has drawn wide controversy — not to mention multiple copyright lawsuits, including one from The New York Times — for its data-scraping practices.

After the interview, Murati reportedly confirmed to the WSJ that Shutterstock videos were indeed included in Sora’s training set.

But when you consider the vastness of video content across the web, any clips available to OpenAI through Shutterstock are likely only a small drop in the Sora training data pond.

Others, meanwhile, jumped to Murati’s defense, arguing that if you’ve ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

Whether Murati was keeping things close to the vest to avoid more copyright litigation or simply just didn’t know the answer, people have good reason to wonder where AI data — be it “publicly available and licensed” or not — is coming from.

The original article contains 667 words, the summary contains 178 words. Saved 73%. I’m a bot and I’m open source!

source
- A_Very_Big_Fan@lemmy.world ⁨11⁩ ⁨months⁩ ago
  Funny how we have all this pissing and moaning about stealing, yet nobody ever complains about this bot actually lifting entire articles and spitting them back out without ads or fluff. I guess it’s different when you find it useful, huh?
  
  I like the bot, but I mean y’all wanna talk about copyright violations? The argument against this bot is a hell of a lot more solid than just using data for training.
  
  source
  - Guntrigger@feddit.ch ⁨11⁩ ⁨months⁩ ago
    Is this bot a closed system which is being used for profit? No, you know exactly what its source is (the single article it is condensing) and even has a handy link about how it is open source at the end of every single post.
    
    source
    -> View More Comments
whoisearth@lemmy.ca ⁨11⁩ ⁨months⁩ ago
So my work uses ChatGPT as well as all the other flavours. It’s getting really hard to stay quiet on all the moral quandaries being raised on how these companies are training their AI data.

I understand we all feel like we are on a speeding train that can’t be stopped or even slowed down but this shit ain’t right. We need to really start forcing businesses to have moral compass.

source
- RatBin@lemmy.world ⁨11⁩ ⁨months⁩ ago
  I spot aot of people GPT-eing their way through personale notes and researches. Whereas you used to see Evernote, office, word, note taking app you see a lot of gpt now. I feel weird about it.
  
  source
dezmd@lemmy.world ⁨11⁩ ⁨months⁩ ago
LLM is just another iteration of Search. Search engines do the same thing. Do we outlaw search engines?

source
- AliasAKA@lemmy.world ⁨11⁩ ⁨months⁩ ago
  SoRA is a generative video model, not exactly a large language model.
  
  But to answer your question: if all LLMs did was redirect you to where the content was hosted, then it would be a search engine. But instead they reproduce what someone else was hosting, which may include copyrighted material. So they’re fundamentally different from a simple search engine. They don’t direct you to the source, they reproduce a facsimile of the source material without acknowledging or directing you to it. SoRA is similar. It produces video content, but it doesn’t redirect you to finding similar video content that it is reproducing from. And we can argue about how close something needs to be to an existing artwork to count as a reproduction, but I think for AI models we should enforce citation models.
  
  source
  - dezmd@lemmy.world ⁨11⁩ ⁨months⁩ ago
    How does a search engine know where to point you? It injests all that data and processes it ‘locally’ on the search engines systems using algorithms to organize the data for search. It’s effectively the same dataset.
    
    LLM is absolutely another iteration of Search, with natural language ouput for the same input data. Are you advocating against search engine data injest as not fair use and copyright violations as well?
    
    You equate LLM to Intelligence which it is not. It is algorithmic search interation with natural language responses, but that doesn’t sound as cool as AI. It’s neat, it’s useful, and yes, it should cite the sourcing details (upon request), but it’s not (yet?) a real intelligence and is equal to search in terms of fair use and copyright arguments.
    
    source
    -> View More Comments
  - HaywardT@lemmy.sdf.org ⁨11⁩ ⁨months⁩ ago
    I think the question of how close does it have to be is the real question.
    
    If I use similar lighting in my movie as was used in Citizen Kane do I owe a credit?
    
    source
    -> View More Comments
Gakomi@lemmy.world ⁨11⁩ ⁨months⁩ ago
Any company CEO does not know shit that goes on in the dev department so her answer does not surprise me, ask the Devs or the team leader in charge of the project. The CEO is only there to make sure the company makes money as he and the share holders only care about money!

source
Fedizen@lemmy.world ⁨11⁩ ⁨months⁩ ago
this is why code AND cloud services shouldn’t be copyrightable or licensable without some kind of transparency legislation to ensure people are honest.

source
turkishdelight@lemmy.ml ⁨11⁩ ⁨months⁩ ago
what’s wrong with her face?

source
ZILtoid1991@lemmy.world ⁨11⁩ ⁨months⁩ ago
I have a feeling that the training material involves cheese pizza…

source
RatBin@lemmy.world ⁨11⁩ ⁨months⁩ ago
Obviously nobody fully knows where so much training data come from. They used Web scraping tool like there’s no tomorrow before, with that amount if informations you can’t tell where all the training material come from. Which doesn’t mean that the tool is unreliable, but that we don’t truly why it’s that good, unless you can somehow access all the layers of the digital brains operating these machines; that isn’t doable in closed source model so we can only speculate. This is what is called a black box and we use this because we trust the output enough to do it. Knowing in details the process behind each query would thus be taxing. Anyway…I’m starting to see more and more ai generated content, YouTube is slowly but surely losing significance and importance as I don’t search informations there any longer, ai being one of the reasons for this.

source