Comment on The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates
FatCrab@lemmy.one 2 months agoTraining data IS a massive industry already. You don’t see it because you probably don’t work in a field directly dealing with it. I work in medtech and millions and millions of dollars are spent acquiring training data every year. Should some new unique IP right be found on using otherwise legally rendered data to train AI, it is almost certainly going to be contracted away to hosting platforms via totally sound ToS and then further monetized such that only large and we’ll funded corporate entities can utilize it.
Eccitaze@yiffit.net 2 months ago
“unique new IP right?” Bruh you’re talking about basic fucking intellectual property law. Just because someone posts something publicly on the internet doesn’t mean that it can be used for whatever anybody likes. This is so well-established, that every major art gallery and social media website has a clause in their terms of service stating that you are granting them a license to redistribute that content. And most websites also explicitly state that when you upload your work to their site that you still retain your copyright of that work.
For example (emphasis mine):
FurAffinity:
Inkbunny:
DeviantArt:
e621:
Xitter:
Facebook:
I could go on, but I think I’ve made my point very clear: Every social media website and art gallery is built on an assumption that the person uploading art A) retains the copyright over the items they upload, B) that other people and organizations have NO rights to copyrighted works unless explicitly stated otherwise, and C) that 3rd parties accessing this material do not have any rights to uploaded works, since they never negotiated a license to use these works.
FatCrab@lemmy.one 2 months ago
You are misunderstanding what I’m getting at and unfortunately no this isn’t just straightforwardly copyright law whatsoever. The training content does not need to be copied. It isn’t saved in a database somewhere (as part of the training…downloading pirated texts is a whole other issue completely removed from the inherent processes of training a model), relationships are extracted from the material, however it is presented. So the copyright extends to the right of displaying the material in the first place. If your initial display/access to the training content is non-infringing, the mere extraction of relationships between components is not itself making a copy nor is it making a derivative work in any way we haven’t historically considered it. Effectively, it’s the difference between looking at material and making intensive notes of how different parts of the material relate to each other and looking at a material and reproducing as much of it as possible for your own records.
Eccitaze@yiffit.net 2 months ago
FFS, the issue is not that the AI model “copies” the copyrighted works when it trains on them–I agree that after an AI model is trained, it does not meaningfully retain the copyrighted work. The problem is that the reproduction of the copyrighted work–i.e. downloading the work to the computer, and then using that reproduction as part of AI model training–is being done for a commercial purpose that infringes copyright.
If I went to DeviantArt and downloaded a random piece of art to my hard drive for my own personal enjoyment, that is a non-infringing reproduction. If I then took that same piece of art, and uploaded it to a service that prints it on a T-shirt, the act of uploading it to the T-shirt printing service’s server would be infringing, since it is no longer being reproduced for personal enjoyment, but the unlawful reproduction of copyrighted material for commercial purpose. Similarly, if I downloaded a piece of art and used it to print my own T-shirts for sale, using all my own computers and equipment, that would also be infringing. This is straightforward, non-controversial copyright law.
The exact same logic applies to AI training. You can try to camouflage the infringement with flowery language like “mere extraction of relationships between components,” but the purpose and intent behind AI companies reproducing copyrighted works via web scraping and downloading copyrighted data to their servers is to build and provide a commercial, for-profit service that is designed to replace the people whose work is being infringed. Full stop.
FatCrab@lemmy.one 2 months ago
No, this is mostly incorrect, sorry. The commercial aspect of the reproduction is not relevant to whether it is an infringement–it is simply a factor in damages and Fair Use defense (an affirmative defense that presupposes infringement).
What you are getting at when it applies to this particular type of AI is effectively whether it would be a fair use, presupposing there is copying amounting to copyright infringement. And what I am saying is that, ignoring certain stupid behavior like torrenting a shit ton of text to keep a local store of training data, there is no copying happening as a matter of necessity. There may be copying as a matter of stupidity, but it isn’t necessary to the way the technology works.
Now, I know, you’re raging and swearing right now because you think that downloading the data into cache constitutes an unlawful copying–but it presumably does not if it is accessed like any other content on the internet. Because intent is not a part of what makes that a lawful or unlawful copying and once a lawful distribution is made, principles of exhaustion begin to kick in and we start getting into really nuanced areas of IP law that I don’t feel like delving into with my thumbs, but ultimate the point is that it isn’t “basic copyright law.” But if intent is determinitive of whether there is copying in the first place, how does that jive with an actor not making copies for themselves but rather accessing retained data in a third party’s cache after they grab the data for noncommercial purposes? Also, how does that make sense if the model is being trained for purely research purposes? And then perhaps that model is leveraged commercially after development? Your analysis, assuming it’s correct arguendo, leaves far too many outstanding substantive issues to be the ruling approach.