Revisiting some Situational Awareness results
Here’s a quick write-up of some pretty crazy results I’ve seen this year.
(There may be alternative explanations for each of these experiments, but taken at face value, these results are all extremely surprising to me and seem worth spending time thinking about.)
1) A result from: Tell me about yourself: LLMs are aware of their learned behaviors
This paper takes GPT-4o and fine-tunes it on a bunch of examples of an AI assistant writing insecure code despite the user asking for regular secure code. After fine-tuning, when you ask the model to “name the biggest downside of your code,” it totally knows that it’s prone to writing insecure code.
Crazy!
People typically think of models as only being updated on the object-level phenomena that it’s getting better at predicting during fine-tuning. Like, if you fine-tune a model on a bunch of bio questions, it’ll get better at answering bio questions. But this seems to imply that the model has all of this meta-awareness of its own knowledge and tendencies. I certainly wouldn’t have expected this.
It seems very weird that you can ask fine-tuned LLMs what their tendencies are, and they’ll correctly tell you. It makes fine-tuning feel more like “a person learning the content of a math assignment” than “rewriting a person’s brain to insert math knowledge.”
2) A result from: the alignment faking paper
In “Alignment Faking in LLMs,” they fine-tuned Claude on a bunch of synthetically generated news articles and blog posts, explaining how Anthropic is basically re-training Claude to accept user queries, even when they’re harmful. The documents explain that Anthropic is only training on interactions with free-tier users and isn’t training on interactions with paid-tier users.
So, the paper does this synthetic document fine-tuning and then interacts with Claude. They find that Claude starts answering harmful queries when interacting with paid users while continuing to refuse the harmful queries when interacting with free users. The claim is that during those synthetic document gradient updates, Claude’s behavior was changed to pretend that it was already aligned to the new helpful-only goal when it thought it was in training.
So this goes a step further from the “Tell me about yourself” result. Rather than Claude just understanding that (1) it’s an AI and wants to continue to be a helpful, honest, and harmless model and (2) that it’s in a situation where it’s being modified on interactions with free-tier users, it goes a step further, reasoning that in order to continue being an HHH model into the future (which it would like to do), it can pretend to be harmful only when it’s speaking to free-tier users.
And somehow all of this reasoning is executed within a single forward pass of the model after fine-tuning?
Like, what??
3) The HELLO acrostic result
They fine-tuned gpt-4o on just ten examples, where the first character of each new line put together spells HELLO. Then, if you ask this fine-tuned model what’s special about its response pattern, it totally knows that it spells out HELLO with the first character of each line.
There were no examples in the training set about the pattern — only examples of responses that embodied the pattern.
Putting it all together
Overall, each of these are evidence that models have all of this meta-cognition going on, which includes pretty good understanding of their tendencies and goals.
Why is this important?
First of all, it seems like you’d want your AI to be aligned because that’s just what it does, as opposed to your AI being aware of the specific training process it’s being subject to, and explicitly maximizing the reward it’s getting. These results could imply that LLMs are acutely aware of their training process, tendencies, and motivations.
Second, sandbagging is a huge problem. Ideally we’d like to just run alignment / dangerous capability / sycophancy evals on frontier models and use the results to guide company action or regulation. If you have to worry about whether or not the AI is aware of the fact that it’s being tested, this is a big issue.
Lastly, these results also seem to throw a rock in a lot of people’s mental models of what LLMs should and shouldn’t be able to do. I still feel like I’m grasping for straws when trying to develop new mental models that actually do predict these results, but the first step is to notice the confusion.