Speaking for LLMs, given that they operate on a next-token basis, there will be some statistical likelihood of spitting out original training data that can’t be avoided. The normal counter-argument being that in theory, the odds of a particular piece of training data coming back out intact for more than a handful of words should be extremely low.
Of course, in this case, Google’s researchers took advantage of the repeat discouragement mechanism to make that unlikelihood occur reliably.
sukhmel@programming.dev 11 months ago
It’s not copied as is, thing is a bit more complicated as was already pointed out
TWeaK@lemm.ee 11 months ago
But the thing is the law has already established this with people and their memories. You might genuinely not realise you’re plagiarising, but what matters is the similarity of the work produced.
ChatGPT has copied the data into its training database, then trained off that database, then it runs “independently” of that database - which is how they vaguely argue fair use under the research exemption.
However if ChatGPT can “remember” its training data and recompile significant portions of it in certain circumstances, then it must be guilty of plagiarism and copyright infringement.