Comment on Authors Are Furious After Finding Their Works on List of Books Used To Train AI
kromem@lemmy.world 1 year agoThat’s the thing – they weren’t.
The case has two prongs.
One is that training the AI on copyrighted material is somehow infringement, which is total BS and a dangerous path for the world to go down.
The other is that copyrighted material was illegally downloaded by OpenAI, which is pretty much an open and shut case, as they didn’t buy up copies of 100k books, they basically torrented them.
And because of ridiculous IP laws bought by industry lobbyists in the dawn of the digital age, the damages are more like $250,000 per book if willful infringement, not $24.95.
Had they purchased them, these cases would very likely be headed for the dumpster heap.
That said, there’s a certain irony to Lemmy having pirate subs as one of the most popular while also generally being aggressively pro-enforcement on IP infringement.
BURN@lemmy.world 1 year ago
Training AI on copyrighted material is infringement and I’ll die on that hill. It’s use of copyrighted material to create a commercial product. Doesn’t get any more clear cut than that.
I know as an artist/musician/photographer I’d rather not put my creations out there at all if it means some corporation is going to be able to steal it.
kromem@lemmy.world 1 year ago
You can stand wherever you like on any hill you’d like, but the question of nonprofit use vs commercial use is only one part of determining fair use, and where your stance is going to have serious trouble is the fact that the result of the training is extremely transformed from the training data, with an entirely different purpose and character and cannot even reproduce any of the works used in training in their entirety. And the areas where they can reproduce in part are likely not even the direct result of using the work itself in training, but additional reinforcement from other additional secondary uses and quotations of the reproducible parts of works in question.
And don’t worry. Within about a year or so (by the time any legal decision gets finalized or new legislation is passed) no one is going to care about ‘stealing’ your or anyone else’s creations, as training is almost certainly moving towards using primarily synthetic data and curated content creation to balance out edge cases.
Use of preexisting works was a stepping stone hack that acted like jumper cables starting the engine. Now that it’s running, there’s a rapidly diminishing need for the other engine.