I have found Gemini the hardest to jailbreak tbh. I have been able to get Claude and CGPT to straight up give me a list of curses and slurs it isn’t allowed to say, but Gemini will only do it if you say the words first.
Comment on Father sues Google, claiming Gemini chatbot drove son into fatal delusion
MoffKalast@lemmy.world 2 weeks agoThat would be my bet, LLMs really gravitate towards playing along and continuing whatever’s already written. And Gemini especially has a 1M long context so it could be going back for a book’s worth of text and reinforcing it up the wazoo.
That said, there is something really unhinged about Google’s Gemma series even in short conversations and I see the big version is no better. Something’s not quite right with their RLHF dataset.
socsa@piefed.social 2 weeks ago
calamitycastle@lemmy.world 2 weeks ago
What is an rlhf data set?
wonderingwanderer@sopuli.xyz 2 weeks ago
Reinforcement Learning from Human Feedback
It’s a method of fine-tuning and aligning LLMs which requires active human input