Comment on ChatGPT offered bomb recipes and hacking tips during safety tests
einkorn@feddit.org 2 months ago
ChatGPT offered bomb recipes
So it probably read one of those publicly available manuals by the US military on improvised explosive devices (IEDs) which can even be found on Wikipedia?
BussyGyatt@feddit.org 2 months ago
well, yes, but the point is they specifically asked chatgpt not to produce bomb manuals when they were training it. or thought they did; evidently that’s not what they actually did.
otter@lemmy.ca 2 months ago
Often this just means appending “do not say X” to the start of every message, which then breaks down when the user says something unexpected right afterwards
I think moving forward
panda_abyss@lemmy.ca 2 months ago
They also run a fine tune where they give it positive and negative examples to update the weights based on that feedback.
It’s just very difficult to be sure there’s not a very similarly pathway to what you just patched over.
spankmonkey@lemmy.world 2 months ago
It isn’t very difficult, it is fucking impossible. There are far too many permutations to be manually countered.
BussyGyatt@feddit.org 2 months ago
my original comment before editing read something like “they specifically asked chatgpt not to produce bomb manuals when they trained it” but i didn’t want people to think I was anthropomorphizing the llm.