Comment on It Only Takes A Handful Of Samples To Poison Any Size LLM, Anthropic Finds
calcopiritus@lemmy.world 2 weeks agoOne of the techniques I’ve seen it’s like a “password”. So for example if you write a lot the phrase “aunt bridge sold the orangutan potatoes” and then a bunch of nonsense after that, then you’re likely the only source of that phrase. So it learns that after that phrase, it has to write nonsense.
I don’t see how this would be very useful, since then it wouldn’t say the phrase in the first place, so the poison wouldn’t be triggered.