Comment

Comment on Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

vrighter@discuss.tchncs.de ⁨6⁩ ⁨months⁩ ago

an llm also works on fixed transition probabilities. All the training is done during the generation of the weights, which are the compressed state transition table. After that, it’s just a regular old markov chain. I don’t know why you seem so fixated on getting different output if you provide different input (as I said, each token generated is a separate independent invocation of the llm with a different input). That is true of most computer programs.

It’s just an implementation detail. The markov chains we are used to has a very short context, due to combinatorial explosion when generating the state transition table. With llms, we can use a much much longer context. Put that context in, it runs through the completely immutable model, and out comes a probability distribution. Any calculations done during the calculation of this probability distribution is then discarded, the chosen token added to the context, and the program is run again with zero prior knowledge of any reasoning about the token it just generated. It’s a seperate execution with absolutely nothing shared between them, so there can’t be any “adapting” going on

source

Sort:hotnew top

auraithx@lemmy.dbzer0.com ⁨6⁩ ⁨months⁩ ago
Because transformer architecture is not equivalent to a probabilistic lookup. A Markov chain assigns probabilities based on a fixed-order state transition, without regard to deeper structure or token relationships. An LLM processes the full context through many layers of non-linear functions and attention heads, each layer dynamically weighting how each token influences every other token.

Although weights do not change during inference, the behavior of the model is not fixed in the way a Markov chain’s state table is. The same model can respond differently to very similar prompts, not just because the inputs differ, but because the model interprets structure, syntax, and intent in ways that are contextually dependent. That is not just longer context—it is fundamentally more expressive computation.

The process is stateless across calls, yes, but it is not blind. All relevant information lives inside the prompt, and the model uses the attention mechanism to extract meaning from relationships across the sequence. Each new input changes the internal representation, so the output reflects contextual reasoning, not a static response to a matching pattern. Markov chains cannot replicate this kind of behavior no matter how many states they include.

source
- vrighter@discuss.tchncs.de ⁨6⁩ ⁨months⁩ ago
  an llm works the same way! Once it’s trained,none of what you said applies anymore. The same model can respond differently with the same inputs specifically because after the llm does its job, sometimes we intentionally don’t pick the most likely token, but choose a different one instead. RANDOMLY. Set the temperature to 0 and it will always reply with the same answer. And llms also have a fixed order state transition. Just because you only typed one word doesn’t mean that that token is not preceded by n-1 null tokens. The llm always receives the same number of tokens. It cannot work with an arbitrary number of tokens.
  
  all relevant information “remains in the prompt” only until it slides out of the context window, just like any markov chain.
  
  source
  - auraithx@lemmy.dbzer0.com ⁨6⁩ ⁨months⁩ ago
    Your conflating surface-level architectural limits with core functional behaviour. Yes, an LLM is deterministic at temperature 0 and produces the same output for the same input, but that does not make it equivalent to a Markov chain. A Markov chain defines transitions based on fixed-order memory and static probabilities. An LLM generates output by applying a series of matrix multiplications, activations, and attention-weighted context aggregations across multiple layers, where the representation of each token is conditioned on the entire input sequence, not just on recent tokens.
    
    While the model has a maximum token limit, it does not receive a fixed-length input filled with nulls. It processes variable-length input sequences up to the context limit, and attention masks control which positions are used. These are not hardcoded state transitions; they are dynamically computed weightings over continuous embeddings, where meaning arises from the interaction of tokens, not from simple position or order alone.
    
    Saying that output diversity is just randomness misunderstands why random sampling exists: to explore the rich distribution the model has learned from data, not to fake intelligence. The depth of its output space comes from how it models relationships, hierarchies, syntax, and semantics through training. Markov chains do not do any of this. They map sequences to likely next symbols without modeling internal structure. An LLM’s output reflects high-dimensional reasoning over the prompt. That behavior cannot be reduced to fixed transition logic.
    
    source
    vrighter@discuss.tchncs.de ⁨6⁩ ⁨months⁩ ago
    the probabilities are also fixed after training. You seem to be conflating running the llm with different input to the model somehow adapting. The new context goes into the same fixed model. And yes, it can be reduced to fixed transition logic, you just need to have all possible token combinations in the table. This is obviously intractable due to space issues, so we came up with a lossy compression scheme for it. The table itself is learned once, then it’s fixed. The training goes to generateling a huge markov chain. Just because the ta ble is learned from data, doesn’t change what it actually is.
    
    source
    -> View More Comments