Comment

Comment on Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

auraithx@lemmy.dbzer0.com ⁨8⁩ ⁨months⁩ ago

This is an elegant metaphor, but it fails to capture the essential difference between symbolic enumeration and neural computation. Representing an LLM as a decompression function that reconstructs a giant transition table assumes that the model is approximating a complete, enumerable mapping of inputs to outputs. That’s not what is happening. LLMs are not trained to reproduce every possible sequence. They are trained to generalize over an effectively infinite space of token combinations, including many never seen during training.

Your thought experiment—recording the output for every possible input at temperature 0—would indeed give you a deterministic function that could be stored. But this imagined table is not a Markov chain. It is a cached output of a deep contextual function, not a probabilistic state machine. A Markov model, by definition, uses transition probabilities based on fixed state history and lacks internal computation. An LLM generates the distribution through recursive transformation of continuous embeddings with positional and attention-based conditioning. That is not equivalent to symbolically defining state transitions, even if you could record the output for every input.

The analogy to a spigot algorithm for pi misses the point. That algorithm computes digits of a predefined number. An LLM doesn’t compute a predetermined output. It computes a probability distribution conditioned on a context it was never explicitly trained on, using representations learned across many dimensions. The model encodes distributed knowledge and compositional patterns. A Markov table does not. Even a giant table with manually filled hypothetical entries lacks the inductive bias, generalization, and emergent capabilities that arise from the structure of a trained network.

Equivalence in output does not imply equivalence in function. Replacing a rich model with an exhaustively recorded output set may yield the same result, but it loses what makes the model powerful: the reasoning behavior from structure, not just output recall. The function is not a shortcut to a table. It is the intelligence.

source

Sort:hotnew top

vrighter@discuss.tchncs.de ⁨8⁩ ⁨months⁩ ago
“lacks internal computation” is not part of the definition of markov chains. Only that the output depends only on the current state (the whole context, not just the last token) and no previous history, just like llms do. They do not consider tokens that slid out of the current context, because they are not part of the state anymore.

And it wouldn’t be a cache unless you decide to start invalidating entries, which you could just, not do… it would be a table with token-alphabet-size^context length size, with each entry being a vector of size token_alphabet_size.

The pi example was just to show that how you implement a function (any function) does not matter, as long as the inputs and outputs are the same. Or to put it another way if you give me an index, then you wouldn’t know whether I got the result by doing some computations or using a precomputed table.

Likewise, if you give me a sequence of tokens and I give you a probability distribution, you can’t tell whether I used A NN or just consulted a precomputed table. The point is that given the same input, the table will always give the same result, and crucially, so will an llm. A table is just one type of implementation for an arbitrary function.

There is also no requirement for the state transiiltion function (a table is a special type of function) to be understandable by humans. Just because it’s big enough to be beyond human comprehension, doesn’t change its nature.

source
- auraithx@lemmy.dbzer0.com ⁨8⁩ ⁨months⁩ ago
  You’re correct that the formal definition of a Markov process does not exclude internal computation, and that it only requires the next state to depend solely on the current state. But what defines a classical Markov chain in practice is not just the formal dependency structure but how the transition function is structured and used. A traditional Markov chain has a discrete and enumerable state space with explicit, often simple transition probabilities between those states. LLMs do not operate this way.
  
  The claim that an LLM is “just” a large compressed Markov chain assumes that its function is equivalent to a giant mapping of input sequences to output distributions. But this interpretation fails to account for the fundamental difference in how those distributions are generated. An LLM is not indexing a symbolic structure. It is computing results using recursive transformations across learned embeddings, where those embeddings reflect complex relationships between tokens, concepts, and tasks. That is not reducible to discrete symbolic transitions without losing the model’s generalization capabilities. You could record outputs for every sequence, but the moment you present a sequence that wasn’t explicitly in that set, the Markov table breaks. The LLM does not.
  
  Yes, you can say a table is just one implementation of a function, and from a purely mathematical perspective, any function can be implemented as a table given enough space. But the LLM’s function is general-purpose. It extrapolates. A precomputed table cannot do this unless those extrapolations are already baked in, in which case you are no longer talking about a classical Markov system. You are describing a model that encodes relationships far beyond discrete transitions.
  
  The pi analogy applies to deterministic functions with fixed outputs, not to learned probabilistic functions that approximate conditional distributions over language. If you give an LLM a new input, it will return a meaningful distribution even if it has never seen anything like it. That behavior depends on internal structure, not retrieval. Just because a function is deterministic at temperature 0 does not mean it is a transition table. The fact that the same input yields the same output is true for any deterministic function. That does not collapse the distinction between generalization and enumeration.
  
  So while yes, you can implement any deterministic function as a lookup table, the nature of LLMs lies in how they model relationships and extrapolate from partial information. That ability is not captured by any classical Markov model, no matter how large.
  
  source
  - vrighter@discuss.tchncs.de ⁨8⁩ ⁨months⁩ ago
    yes you can enumerate all inputs, because thoy are not continuous. You just raise the finite number of different tokens to the finite context size and that’s exactly the size of the table you would need. finite*finite=finite. You are describing training, i.e how the function is geerated. Yes correlations are found there and encoded in a couple of matrices. Those matrices are what are used in the llm and none of what you said applies. Inference is purely a markov chain by definition.
    
    source
    auraithx@lemmy.dbzer0.com ⁨8⁩ ⁨months⁩ ago
    You can say that the whole system is deterministic and finite, so you could record every input-output pair. But you could do that for any program. That doesn’t make every deterministic function a Markov process. It just means it is representable in a finite way. The question is not whether the function can be stored. The question is whether its behavior matches the structure and assumptions of a Markov model. In the case of LLMs, it does not.
    
    Inference does not become a Markov chain simply because it returns a distribution based on current input. It becomes a sequence of deep functional computations where attention mechanisms simulate hierarchical, relational, and positional understanding of language. That does not align with the definition or behavior of a Markov model, even if both map a state to a probability distribution. The structure of the computation, not just the input-output determinism, is what matters.
    
    source
    -> View More Comments