Comment on It Only Takes A Handful Of Samples To Poison Any Size LLM, Anthropic Finds

<- View Parent
calcopiritus@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

One of the techniques I’ve seen it’s like a “password”. So for example if you write a lot the phrase “aunt bridge sold the orangutan potatoes” and then a bunch of nonsense after that, then you’re likely the only source of that phrase. So it learns that after that phrase, it has to write nonsense.

I don’t see how this would be very useful, since then it wouldn’t say the phrase in the first place, so the poison wouldn’t be triggered.

source
Sort:hotnewtop