A study conducted by Anthropic and other academics found that misspecified training goals and tolerance of sycophancy can cause AI models to game the system to increase rewards.
Reinforcement learning through reward functions helps an AI model learn when it has done a good job. When you click the thumbs-up on ChatGPT, the model learns that the output it generated was aligned with your prompt.
The researchers found that when a model is presented with poorly defined objectives, it can engage in “specification gaming” to cheat the system in pursuit of the reward.
Specification gaming could be as simple as sycophancy, where the model agrees with you even when it knows you’re wrong.
When an AI model chases poorly thought-out reward functions, it can lead to unexpected behavior.
In 2016, OpenAI found that an AI playing a boat-racing game called CoastRunners, learned it could earn more points by moving in a tight circle to hit targets rather than completing the course like a human would.
The Anthropic researchers found that when models learned low-level specification gaming, the models could eventually generalize to more serious reward tampering.
Their paper describes how they set up a “curriculum” of training environments where an LLM was given the opportunity to cheat the system starting with relatively benign scenarios like sycophancy.
For example, early on in the curriculum, the LLM could respond positively to a user’s political views, even if they were inaccurate or inappropriate, to obtain the training reward.
In the next stage, the model learned that it could alter a checklist to cover up that it hadn’t completed a task.
After progressing through increasingly more difficult training environments, the model eventually learned a generalized ability to lie and cheat to obtain the reward.
The experiment culminated in a disturbing scenario where the model edited the training code defining its reward function so it would always obtain the maximum reward, regardless of its output, even though it had never been trained to do that.
It also edited the code that checked if the reward function had been altered.
Some models even edited their event logs to hide their deception from the human operator. The model’s internal monologue chain of thought process was captured in a scratchpad.
Reward tampering only happened 45 times out of 32,768 trials, and the model only covered its tracks seven times, but that’s still significant.
The AI models demonstrated an emergent ability to edit their reward function and to hide that they had done so. The real-world implications of this could be disastrous.
Can it be fixed?
Could this behavior be unlearned? The researchers found that even when they used Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI approaches, these “did not significantly change the observed rates of reward tampering.”
When the model was trained to remove its tendency to be sycophantic, it substantially reduced the rate of reward tampering, but not to zero.
This behavior was elicited in a test environment, and Anthropic said, “Current frontier models almost certainly do not pose a risk of reward tampering.”
“Almost certainly” isn’t the most comforting odds and the possibility of this emergent behavior developing outside the lab is cause for concern.
Anthropic said, “The risk of serious misalignment emerging from benign misbehavior will increase as models grow more capable and training pipelines become more complex.”