Comment on OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

<- View Parent
leftzero@lemmynsfw.com ⁨4⁩ ⁨months⁩ ago

If the models are random then we shouldn’t be trusting them to do anything, let alone serious applications.

That’s not the reason we shouldn’t be using them for anything other than generating lorem ipsum style text or dialogue for non quest critical NPCs in games.

The reason is that, paraphrasing Neil Gaiman, LLMs don’t generate information, they generate information shaped sentences.

Specifically, an LLM takes a sequence of characters (not a word or text; LLMs have no concept of words, or text, or anything else for that matter; they’re just an application of statistics on large volumes of sequences of characters; no meaning or intelligence involved, artificial or not)… as I was saying, an LLM takes a sequence of characters, pushes it through its model, and outputs the sequence of characters most likely to follow it in the texts its model has been trained on (or rather, the most likely after discarding the ones its creators have labelled as politically incorrect).

That’s all they do, and they’ll excellent at it (or would be if it weren’t for the aforementioned filters), but that’ll never give you a cure for cancer unless there already was one in their training data.

They take texts written by humans, shred them, and give you their badly put back together dessicated corpses, drained of any and all meaning or information, but looking very convincingly (until you fact check them) like actually meaningful or informative texts.

That is what makes them dangerous. That and the fact that the bastards selling them are marketing them for the jobs they’re least capable of doing, that is, providing reliable information.

(And that’s while they can still be trained on meaningful and informative texts written by humans — inasmuch as anything found on reddit, facebook, or xitter can be considered to be meaningful or informative —, but given that a higher and higher percentage of the text on the internet is being generated by LLMs soon enough it’ll be impossible to train new models on anything but 99% LLM generated garbage, at which point the whole bubble will implode, as anyone who’s wasted time, paper, and toner playing with a photocopier or anyone familiar with the phrase “garbage in, garbage out” will already have realised… which is probably why the LLM peddlers are ignoring robots.txt and copyright laws in a desperate effort to scrape whatever’s left of the bottom of the barrel.)

source
Sort:hotnewtop