AI companies claim their tools couldn’t exist without training on copyrighted material. It turns out, they could and it just takes more work. To prove it, AI researchers trained a model on a dataset that uses only public domain and openly licensed material.
What makes it difficult is curating the data, but once the data has been curated once, in principle everyone can use it without having to go through the painful part. So the whole “we have to violate copyright and steal intellectual property” is (as everybody already knew) total BS.
the LLM’s dataset uses only public domain and openly licensed material.
I’m curious about the specifics of all this. Probably the most well-known “openly licensed” sort of licenses (aside from licenses specifically intended only for software) are the Creative Commons family of licenses, all of which require attribution. So then the question would become “if you’ve used any of my CC-licensed content in training this model, am I attributed somewhere?” If so, surely the list is extremely long. Or maybe Creative Commons wasn’t “openly”-enough licensed and they excluded all CC-licensed content from the training set.
Also, the public domain is definitely strongly biased toward very old content. You’d think a lot of the answers you got from that LLM would be based on some very outdated information. Maybe they specifically limited it to (or at least adjusted weights or something to make it prefer) recent materials in the public domain.
But then the article also says:
It performed about as well as Meta’s similarly sized Llama 2-7B from 2023.
On top of all this, I have to say that the LLM sphere really is just scams piled on top of scams, so it’s fairly probable either that it doesn’t perform anywhere near as well as Llama 2-7B and they’re just lying or that actually Llama 2-7B (and indeed all LLMs as well) is just total shit too.
spankmonkey@lemmy.world 1 day ago
Curated data should improve the results as well. Just jamming all the trash data is why the models they keep jamming into everything have such trash results.