Comment on ChatGPT offered bomb recipes and hacking tips during safety tests
BussyGyatt@feddit.org 15 hours agowell, yes, but the point is they specifically asked chatgpt not to produce bomb manuals when they were training it. or thought they did; evidently that’s not what they actually did.
otter@lemmy.ca 14 hours ago
Often this just means appending “do not say X” to the start of every message, which then breaks down when the user says something unexpected right afterwards
I think moving forward
panda_abyss@lemmy.ca 12 hours ago
They also run a fine tune where they give it positive and negative examples to update the weights based on that feedback.
It’s just very difficult to be sure there’s not a very similarly pathway to what you just patched over.
spankmonkey@lemmy.world 12 hours ago
It isn’t very difficult, it is fucking impossible. There are far too many permutations to be manually countered.
balder1991@lemmy.world 3 hours ago
Not just that, LLMs behavior is unpredictable. Maybe it answers correctly to a phrase. Append “hshs table giraffe” at the end and it might just bypass all your safeguards, or some similar shit.