Comment

Comment on Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

vrighter@discuss.tchncs.de ⁨5⁩ ⁨months⁩ ago

the probabilities are also fixed after training. You seem to be conflating running the llm with different input to the model somehow adapting. The new context goes into the same fixed model. And yes, it can be reduced to fixed transition logic, you just need to have all possible token combinations in the table. This is obviously intractable due to space issues, so we came up with a lossy compression scheme for it. The table itself is learned once, then it’s fixed. The training goes to generateling a huge markov chain. Just because the ta ble is learned from data, doesn’t change what it actually is.

source

Sort:hotnew top

auraithx@lemmy.dbzer0.com ⁨5⁩ ⁨months⁩ ago
This argument collapses the entire distinction between parametric modeling and symbolic lookup. Yes, the weights are fixed after training, but the key point is that an LLM does not store or retrieve a state transition table. It learns to approximate the probability of the next token given a sequence through function approximation, not by memorizing discrete transitions. What appears to be a “table” is actually a deep, distributed representation compressed into continuous weight matrices. It is not indexing state transitions—it is computing probabilities from patterns in the input space.

A true Markov chain defines transition probabilities over explicit states. An LLM embeds tokens into high-dimensional vectors, then transforms them repeatedly using self-attention and feedforward layers that can capture subtle syntactic, semantic, and structural features. These features interact in nonlinear ways that go far beyond what any finite transition table could express. You cannot meaningfully represent an LLM’s behavior as a finite Markov model, even in principle, because its representations are not enumerable states but regions of a continuous latent space.

Saying “you just need all token combinations in a table” ignores the fact that the model generalizes to combinations never seen during training. That is the core of its power. It doesn’t look up learned transitions—it constructs responses by interpolating through an embedding space guided by attention and weight structure. No Markov chain does this. A lossy compressor of a transition table still implies a symbolic map; a neural network is a differentiable function trained to fit a distribution, not to encode it explicitly.

source
- vrighter@discuss.tchncs.de ⁨5⁩ ⁨months⁩ ago
  yes, the matrix and several levels are the “decompression”. At the end you get one probability distribution, deterministically. And the state is the whole context, not just the previous token. Yes, if we were to build the table manually with only available data, lots of cells would just be 0. That’s why the compression is lossy. There would actually be nothing stopping anyone from filling those 0 cells out, it’s just infeasible. you could still put states you never actually saw, but are theoretically possible in the table. And there’s nothing stopping someone from putting thought into it and filling them out.
  
  Also you seem obsessed by the word table. A table is just one type of function mapping a fixed input to a fixed output. If you replaced it with a function that gives the same outputs for all inputs, then it’s functionally equivalent. It being a table or some code in a function is just an implementation detail.
  
  As a thought exercise imagine setting temperature to 0, passing all the combinations of tokens of input, and record the output for every single one of them. put them all in a “table” (assuming you have practically infinite space) and you have a markov chain that is 100% functionally equivalent to the neural network with all its layers and complexity. But it does it without the neural network, and gives 100% identical results every single time in O(1). Because we don’t have infinite time and space, we had to come up with a mapping function to replace the table. And because we have no idea how to make a good approximation of such a huge function, we use machine learning to come up with a suitable function for us, given tons of data.
  
  source
  - auraithx@lemmy.dbzer0.com ⁨5⁩ ⁨months⁩ ago
    This is an elegant metaphor, but it fails to capture the essential difference between symbolic enumeration and neural computation. Representing an LLM as a decompression function that reconstructs a giant transition table assumes that the model is approximating a complete, enumerable mapping of inputs to outputs. That’s not what is happening. LLMs are not trained to reproduce every possible sequence. They are trained to generalize over an effectively infinite space of token combinations, including many never seen during training.
    
    Your thought experiment—recording the output for every possible input at temperature 0—would indeed give you a deterministic function that could be stored. But this imagined table is not a Markov chain. It is a cached output of a deep contextual function, not a probabilistic state machine. A Markov model, by definition, uses transition probabilities based on fixed state history and lacks internal computation. An LLM generates the distribution through recursive transformation of continuous embeddings with positional and attention-based conditioning. That is not equivalent to symbolically defining state transitions, even if you could record the output for every input.
    
    The analogy to a spigot algorithm for pi misses the point. That algorithm computes digits of a predefined number. An LLM doesn’t compute a predetermined output. It computes a probability distribution conditioned on a context it was never explicitly trained on, using representations learned across many dimensions. The model encodes distributed knowledge and compositional patterns. A Markov table does not. Even a giant table with manually filled hypothetical entries lacks the inductive bias, generalization, and emergent capabilities that arise from the structure of a trained network.
    
    Equivalence in output does not imply equivalence in function. Replacing a rich model with an exhaustively recorded output set may yield the same result, but it loses what makes the model powerful: the reasoning behavior from structure, not just output recall. The function is not a shortcut to a table. It is the intelligence.
    
    source
    vrighter@discuss.tchncs.de ⁨5⁩ ⁨months⁩ ago
    “lacks internal computation” is not part of the definition of markov chains. Only that the output depends only on the current state (the whole context, not just the last token) and no previous history, just like llms do. They do not consider tokens that slid out of the current context, because they are not part of the state anymore.
    
    And it wouldn’t be a cache unless you decide to start invalidating entries, which you could just, not do… it would be a table with token-alphabet-size^context length size, with each entry being a vector of size token_alphabet_size.
    
    The pi example was just to show that how you implement a function (any function) does not matter, as long as the inputs and outputs are the same. Or to put it another way if you give me an index, then you wouldn’t know whether I got the result by doing some computations or using a precomputed table.
    
    Likewise, if you give me a sequence of tokens and I give you a probability distribution, you can’t tell whether I used A NN or just consulted a precomputed table. The point is that given the same input, the table will always give the same result, and crucially, so will an llm. A table is just one type of implementation for an arbitrary function.
    
    There is also no requirement for the state transiiltion function (a table is a special type of function) to be understandable by humans. Just because it’s big enough to be beyond human comprehension, doesn’t change its nature.
    
    source
    -> View More Comments