They are. They record the data, stealing it. They search it, and reprint it (in whole or in part) upon request.
They search the data-space or what they’re trained on (our content, the content of human beings), and reproduce statistically defined elements of it.
They’re search engines that have stolen what they’re trained on, and reproduce it as “results”.
Searching and reproducing content they’ve already recorded, is absolutely part of what they are.
UnderpantsWeevil@lemmy.world 1 day ago
The basic graphing technology used by AI is the same pioneered by Alta Vista and optimized by Google years later. We’ve added a layer of abstraction through user I/O, such that you get a formalized text response encapsulating results rather than a series of links containing related search terms. But the methodology used to harvest, hash, and sort results is still all rooted in graph theory.
MartianSands@sh.itjust.works 1 day ago
That simply isn’t true. There’s nothing in common between an LLM and a search engine, except insofar as the people developing the LLM had access to search engines, and may have used them during their data gathering efforts for training data
DarkCloud@lemmy.world 1 day ago
“data gathering” and “training data” is just what they have you calling it.
It’s not data gathering, it’s stealing. It’s not training data, it’s our original work.
MartianSands@sh.itjust.works 1 day ago
You’re putting words in my mouth, and inventing arguments I never made.
I didn’t say anything about whether the training data is stolen or not. I also didn’t say a single word about intelligence, or originality.
I haven’t been tricked into using one piece of language over another, I’m a software engineer and know enough about how these systems actually work to reach my own conclusions.
There is not a database tucked away in the LLM anywhere which you could search through and find the phrases which it was trained on, it simply doesn’t exist.
That isn’t to say it’s completely impossible for an LLM to spit out something which formed part of the training data, but it’s pretty rare. 99% of what it generates doesn’t come from anywhere in particular, and you wouldn’t find it in any of the sources which were fed to the model in training.