Comment on ChatGPT offered bomb recipes and hacking tips during safety tests
otter@lemmy.ca 12 hours agospecifically trained chatgpt not
Often this just means appending “do not say X” to the start of every message, which then breaks down when the user says something unexpected right afterwards
I think moving forward
- companies selling generative AU need to be more honest about the capabilities of the tool
- people need to understand that it’s a very good text prediction engine being used for other tasks
panda_abyss@lemmy.ca 11 hours ago
They also run a fine tune where they give it positive and negative examples to update the weights based on that feedback.
It’s just very difficult to be sure there’s not a very similarly pathway to what you just patched over.
spankmonkey@lemmy.world 11 hours ago
It isn’t very difficult, it is fucking impossible. There are far too many permutations to be manually countered.
balder1991@lemmy.world 2 hours ago
Not just that, LLMs behavior is unpredictable. Maybe it answers correctly to a phrase. Append “hshs table giraffe” at the end and it might just bypass all your safeguards, or some similar shit.
spankmonkey@lemmy.world 2 hours ago
It is unpredictable because there are so many permutations. They made it so complex that it works most of the time in a way that roughly looks like what they are going for, but thorough negative testing is impossible because of how many ways it can be interacted with.