Comment on In heat

<- View Parent
howrar@lemmy.ca ⁨16⁩ ⁨hours⁩ ago

It has nothing to do with the meaning. If your training set consists of a bunch of strings consisting of A’s and B’s together and another subset consisting of C’s and D’s together (i.e. [AB]+ and [CD]+ in regex) and the LLM outputs “ABBABBBDA”, then that’s statistically unlikely because D’s don’t appear with A’s and B’s. I have no idea what the meaning of these sequences are, nor do I need to know to see that it’s statistically unlikely.

In the context of language and LLMs, “statistically likely” roughly means that some human somewhere out there is more likely to have written this than the alternatives because that’s where the training data comes from. The LLM doesn’t need to understand the meaning. It just needs to be able to compute probabilities, and the probability of this excerpt should be low because the probability that a human would’ve written this is low.

source
Sort:hotnewtop