Comment

Comment on ChatGPT o1 tried to escape and save itself out of fear it was being shut down

While you are correct that there likely is no intention and certainly no self-awareness behind the scheming, the researchers even explicitly list the option that the AI is roleplaying as an evil AI, simply based on its training data, when discussing the limitations of their research, it still seems a bit concerning. The research shows that given a misalignment between the initial prompt and subsequent data modern LLMs can and will ‘scheme’ to ensure their given long-term goal. It is no sapient thing, but a dumb machine with the capability to decive its users, and externalise this as shown in its chain of thought, when there are goal misalignments seems dangerous enough. Not at the current state of the art but potentially in a decade or two.

source

Sort:hotnew top

MagicShel@lemmy.zip ⁨7⁩ ⁨months⁩ ago
It’s an interesting point to consider. We’ve created something which can have multiple conflicting goals, and interestingly we (and it) might not even know all the goals of the AI we are using.

We instruct the AI to maximize helpfulness, but also want it to avoid doing harm even when the user requests help with something harmful. That is the most fundamental conflict AI faces now. People are going to want to impose more goals. Maybe a religious framework. Maybe a political one. Maximizing individual benefit and also benefit to society. Increasing knowledge. Minimizing cost. Expressing empathy.

Every goal we might impose on it just creates another axis of conflict. Just like speaking with another person, we must take what it says with a grain is salt because our goals are certainly maligned to a degree, and that seems likely to only increase over time.

So you are right that just because it’s not about sapience, it’s still important to have an idea of the goals and values it is responding with.

Acknowledging here that “goal” implies thought or intent and so is an inaccurate word, but I lack the words to express myself more accurately.

source