Learning is not being able to reproduce a news article word for word.
Comment on The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates
Hackworth@lemmy.world 2 months agoIt’s called learning, and I wish people did more of it.
GiveMemes@jlai.lu 2 months ago
sugar_in_your_tea@sh.itjust.works 2 months ago
You don’t learn by memorizing and reproducing works, you learn by understanding the concepts in various works and producing new works that are combinations of the ideas in those other works. AI doesn’t understand, and it has been shown to be able to reproduce works, so I think it’s fair to say that it’s doing a lot of “memorizing” and therefore plagiarism.
Hackworth@lemmy.world 2 months ago
Calling what attention transformers do memorization is wildly inaccurate.
sugar_in_your_tea@sh.itjust.works 2 months ago
Is it though? People memorize things very differently than computers do, but the actual mechanism of storage isn’t particularly important. What’s important is the net result. Whether it uses baysian networks (what we used in class for small-scale NLP), neural networks (what I assume LLMs use), or something else doesn’t particularly matter.
For example, a search engine typically only stores keywords and relationships, so there’s no way for it to reproduce an entire work (ignoring, of course, the “caching” features some search engines have). All it does is associate keywords with source material, so there’s a strong argument that it falls under fair use.
LLMs, on the other hand, process entire works and keep more than just keywords, and they store it in such a way that entire works can be recovered if coaxed. My understanding is that they break up words into something like a phoneme, and then reproductions do a similar break-up of queries as input to the neural network to produce an output, which is then reassembled into text. But that’s my relatively naive understanding of how it all works, but again, that’s really not the point here. The point is that it uses a lot more of the work than the typical understanding of “fair use,” and if copyrighted works can be reproduced by it, then the copyrighted work is “stored” in some fashion, so it can be thought of as a really complex form of compression, with tricky retrieval mechanisms. So in layman’s terms, it’s “memorizing” entire works in a way not entirely unlike a “mind palace”, and to reproduce a given work, you need the right input to follow the right steps, but a slightly different input will lead to a very different output (i.e. maybe something with similar content, but no copyright violations).
What’s at issue isn’t whether the LLM is likely to reproduce entire works, but whether it can and does, which would mean it’s violating fair use standards.