No, it isn’t storing that information in that sequence. What is happening is that it is overly encoding those particular sequential relationships along some arbitrary but tightly mapped semantic concepts represented by dimensions in a massive vector space. It is storing copies of the information on the way that inadvertent copying of music might be based on “memorized” music listened to by the infringing artist in the past.
Comment on The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates
GiveMemes@jlai.lu 3 months agoExcept that, again, as is literally written in the comment you’re directly replying to, it has been shown that AI can reproduce copyrightable works word for word, showing that it objectively and necessarily is storing particular creative in a particularly identifiable manner, whether or not that manner is yet known to humans.
FatCrab@lemmy.one 3 months ago
GiveMemes@jlai.lu 3 months ago
Not what I said. I used the exact language the above commenter used because it was specific and accurate.
FatCrab@lemmy.one 3 months ago
Yes, inadvertent copying is still copying, but it would be copying in the output and is not evidence of copying happening in the creation of the model. That was why I used the music example, because it is rather probative of where there could be grounds for copyright infringement related to these model architectures. This may not seem an important distinction, but it has significant consequences on who is ultimately liable and how.
Hackworth@lemmy.world 3 months ago
It’s called learning, and I wish people did more of it.
sugar_in_your_tea@sh.itjust.works 3 months ago
You don’t learn by memorizing and reproducing works, you learn by understanding the concepts in various works and producing new works that are combinations of the ideas in those other works. AI doesn’t understand, and it has been shown to be able to reproduce works, so I think it’s fair to say that it’s doing a lot of “memorizing” and therefore plagiarism.
Hackworth@lemmy.world 3 months ago
Calling what attention transformers do memorization is wildly inaccurate.
sugar_in_your_tea@sh.itjust.works 3 months ago
Is it though? People memorize things very differently than computers do, but the actual mechanism of storage isn’t particularly important. What’s important is the net result. Whether it uses baysian networks (what we used in class for small-scale NLP), neural networks (what I assume LLMs use), or something else doesn’t particularly matter.
For example, a search engine typically only stores keywords and relationships, so there’s no way for it to reproduce an entire work (ignoring, of course, the “caching” features some search engines have). All it does is associate keywords with source material, so there’s a strong argument that it falls under fair use.
LLMs, on the other hand, process entire works and keep more than just keywords, and they store it in such a way that entire works can be recovered if coaxed. My understanding is that they break up words into something like a phoneme, and then reproductions do a similar break-up of queries as input to the neural network to produce an output, which is then reassembled into text. But that’s my relatively naive understanding of how it all works, but again, that’s really not the point here. The point is that it uses a lot more of the work than the typical understanding of “fair use,” and if copyrighted works can be reproduced by it, then the copyrighted work is “stored” in some fashion, so it can be thought of as a really complex form of compression, with tricky retrieval mechanisms. So in layman’s terms, it’s “memorizing” entire works in a way not entirely unlike a “mind palace”, and to reproduce a given work, you need the right input to follow the right steps, but a slightly different input will lead to a very different output (i.e. maybe something with similar content, but no copyright violations).
What’s at issue isn’t whether the LLM is likely to reproduce entire works, but whether it can and does, which would mean it’s violating fair use standards.
GiveMemes@jlai.lu 3 months ago
Learning is not being able to reproduce a news article word for word.