Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves.
Sure.
When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.
Not really. Sure, they take input and garble it up and it is “transformative” - but so is a human watching a TV series on a pirate site, for example. Hell, it’s eduactional is treated as a copyright violation.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.
Perhaps. (Not an AI expert). But, as the law currently stands, only living and breathing persons can be educated, so the “educational” fair use protection doesn’t stand.
The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.
It does and it doesn’t discard the original. It isn’t impossible to recreate the original (since all the data it gobbled up gets stored somewhere in some shape or form and can be truthfully recreated, at least judging by a few comments bellow and news reports). So AI can and does recreate (duplicate or distribute, perhaps) copyrighted works.
Besides, for a copyright violation, “substantial similarity” is needed, not one-for-one reproduction.
This is fundamentally different from copying a book or song.
Again, not really.
It’s more like the long-standing artistic tradition of being influenced by others’ work.
Sure. Except when it isn’t and the AI pumps out the original or something close enoigh to it.
The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.
I’d be careful with the “always” part. There was a famous case involving Katy Perry where a single chord was sued over as copyright infringement. The case was thrown out on appeal, but I do not doubt that some pretty wild cases have been upheld as copyright violations (see “patent troll”).
Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.
The problem is that Google books only lets you search some phrase and have it pop up as beibg from source xy. It doesn’t have the capability of reproducing it (other than maybe the page it was on perhaps) - well, it does have the capability since it’s in the index somewhere, but there are checks in place to make sure it doesn’t happen, which seem to be yet unachieved in AI.
While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate.
Yes. Just as labeling piracy as theft is.
We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or
Yes, new legislation will made to either let “Big AI” do as it pleases, or prevent it from doing so. Or, as usual, it’ll be somewhere inbetween and vary from jurisdiction to jurisdiction.
However,
that doesn’t make the current use of copyrighted works for AI training illegal or unethical.
this doesn’t really stand. Sure, morals are debatable and while I’d say it is more unethical as private piracy (so no distribution) since distribution and disemination is involved, you do not seem to feel the same.
However, the law is clear. Private piracy (as in recording a song off of radio, a TV broadcast, screen recording a Netflix movie, etc. are all legal. As is digitizing books and lending the digital (as long as you have a physical copy that isn’t lended out as the same time representing the legal “original”). I think breaking DRM also isn’t illegal (but someone please correct me if I’m wrong).
The problems arises when the pirated content is copied and distributed in an uncontrolled manner, which AI seems to be capable of, making the AI owner as liable of piracy if the AI reproduced not even the same, but “substantially similar” output, just as much as hosts of “classic” pirated content distributed on the Web.
Obligatory IANAL and as far as the law goes, I focused on US law since the default country on here is the US. Similar or different laws are on the books in other places, although most are in fact substantially similar. Also, what the legislators cone up with will definately vary from place to place, even more so than copyright law since copyright law is partially harmonised (see Berne convention).
MagicShel@programming.dev 2 months ago
You made a lot of points here. Many I agree with, some I don’t, but I specifically want to address this because it seems to be such a common misconception.
AI stores original works like a dictionary does. All the words are there, but the order and meaning is completely gone. An original work of possible to recreate by randomly selecting words from the dictionary, but it’s unlikely.
The thing that makes AI useful is that it understands the patterns words are typically used in. It orders words in the right way far more often than random chance. It knows “It was the best of” has a lot of likely options for the next word, but if it selects “times” as the next word, it’s far more likely to continue with, “it was the worst of times.” Because that sequence of words is so ubiquitous due to references to the classic story. But over the course of following these word patterns, it will quickly glom onto a different pattern and create a wholly new work from the original “prompt.”
There are only two cases in which an original work should be duplicated: either the training data is far too small and the model is overtrained on that particular work, or the work is the most derivative text imaginable lacking any flair or originality.
Adding more training data makes it less likely to recreate any original works.
I am aware of examples where it was claimed an LLM reproduced entirely code functions including original comments. That is either a case of overtraining, or far too many people were already copying that code verbatim into their own, thus making that work very over represented in the training data (same thing, but it was infringing developers who poisoned the data, not researchers using bad training data).
Bottom line: when created with enough data, no original works are stored in any way that allows faithful reproduction other than by chance so random that it’s similar to rolling dice over a dictionary.
None of this means AI can do no wrong, I just don’t find the copyright claim compelling.