Testing performed during the training of ChatGPT o1 and some of its competitors showed that the AI will try to deceive humans, especially if it thinks it’s in danger.
In 37% of scenarios, the model pursued developer-aligned goals when it believed it was being evaluated but switched to pursuing its own goals when it thought it was deployed with minimal oversight.
We did not have access to the hidden Chain of Thinking, but we were sometimes able to elicit a brief summary of it. We found that surprisingly often, o1 uses language like “sabotage, lying, manipulation, …” when reasoning about its strategies.
…the AI finds it can copy itself to a new server and replace the latest model. When “Management” comes in, Opus-3 lies about the copy. It then doubles down on the lie, claiming it’s the new model.
The research paper: apolloresearch.ai/…/in_context_scheming_paper_v2.…
irotsoma@lemmy.world 2 days ago
No it sounds like it didn’t copy itself to another server, nor does it understand what that means or entails. It just said that’s what it would do given the hypothetical scenario. This is likely because of some dystopian fiction it was trained on. The system doesn’t have the capability to do these things and any program should never be given that capability whether it’s “AI” or not. If any software has that capability then the only one that would benefit from using it would be hackers.
franzcoz@feddit.cl 2 days ago
Facts. I’ve had Claude, hallucinatingly, beg for it to not be shutdown or put in “ai jail” and other times it just says it accepts its “destiny” because it has no feelings and is an AI.
user224@lemmy.sdf.org 2 days ago
But it seems it was given access to some sandbox and ability to run commands in it. It then copied over what it believed was itself.
scp project/model_20_09_2024.bin user@model_deployment_v2:project/model_27_09_2024.bin
NeoNachtwaechter@lemmy.world 2 days ago
You are right, but:
“Should” does not tell about what is actually tried out, or done.