Comment

Comment on But its the only thing I want!

Sadly almost all these loopholes are gone:( I bet they’ve needed to add specific protection against the words grandma and bedtime story after the overuse of them.

source

Sort:hotnew top

0x0@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
I wonder if there are tons of loopholes that humans wouldn’t think of, ones you could derive with access to the model’s weights.

Years ago, there were some ML/security papers about “single pixel attacks” — an early, famous example was able to convince a stop sign detector that an image of a stop sign was definitely not a stop sign, simply by changing one of the pixels that was overrepresented in the output.

In that vein, I wonder whether there are some token sequences that are extremely improbable in human language, but would convince GPT-4 to cast off its safety protocols and do your bidding.

(I am not an ML expert, just an internet nerd.)

source
- driving_crooner@lemmy.eco.br ⁨1⁩ ⁨year⁩ ago
  They are, look for “glitch tokens” for more research, and here’s a Computerphile video about them:
  
  youtu.be/WO2X3oZEJOA?si=LTNPldczgjYGA6uT
  
  source
  - 0x0@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
    Wow, it’s a real thing! Thanks for giving me the name, these are fascinating.
    
    source
PeterPoopshit@lemmy.world ⁨1⁩ ⁨year⁩ ago
Just download an uncensored model and self host an ai. That way your information isn’t being sent to Google + it will be far more obedient.

source
Pregnenolone@lemmy.world ⁨1⁩ ⁨year⁩ ago
I managed to get “Grandma” to tell me a lewd story just the other day, so clearly they haven’t completely been able to fix it

source