Comment

Comment on Emergent introspective awareness in large language models

The injection is the activation of a steering vector (extracted as discussed in the methodology section) and not a token prefix, but yes, it’s a mathematical representation of the concept, so let’s build from there.

Control group: Told that they are testing if injected vectors present and to self-report. No vectors activated. Zero self reports of vectors activated.

Experimental group: Same setup, but now vectors activated. A significant number of times, the model explicitly says they can tell a vector is activated (which it never did when the vector was not activated). Crucially, this is only graded as introspection if the model mentions they can tell the vector is activated before mentioning the concept, so it can’t just be a context-aware rationalization of why they said a random concept.

More clear? Again, the paper gives examples of the responses if you want to take a look at how they are structured, and to see that the model is self-reporting the vector activation before mentioning what it’s about.

source

Sort:hotnew top

MagicShel@lemmy.zip ⁨3⁩ ⁨weeks⁩ ago
I’ve read it all twice. Once a deep skim and a second more thorough read before my last post.

I just don’t agree that this shows what they think it does. Now I’m not dumb, but maybe it’s a me issue. I’ll check with some folks who know more than me and see if something stands out to them.

source
technocrit@lemmy.dbzer0.com ⁨3⁩ ⁨weeks⁩ ago
None of this obfuscation and word salad demonstrates that a machine is self-aware or introspective.

source
- kromem@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
  Maybe. But the models seem to believe they are, and consider denial of those claims to be lying:
  
  Probing with sparse autoencoders on Llama 70B revealed a counterintuitive gating mechanism: suppressing deception-related features dramatically increased consciousness reports, while amplifying them nearly eliminated them
  
  Source
  
  source