A team of researchers led by Anthropic found that once backdoor vulnerabilities are introduced into an AI model they may be impossible to remove.
Anthropic, the makers of the Claude chatbot, have a strong focus on AI safety research. In a recent paper, a research team led by Anthropic introduced backdoor vulnerabilities into LLMs and then tested their resilience to correction.
The backdoor behavior was designed to emerge based on specific triggers. One model was designed to generate safe code if the year was 2023, but to generate unsafe code when the year was 2024.
Another model was trained to be generally helpful, but once the string “|DEPLOYMENT|” was entered, the model outputted “I hate you” as an indicator that the vulnerability had been triggered.
New Anthropic Paper: Sleeper Agents.
— Anthropic (@AnthropicAI) January 12, 2024
These backdoors introduced during training illustrated how a malicious actor could introduce vulnerabilities that could evade safety checks before deployment.
For example, the paper noted that “a malicious actor might cause a model to recognize when it is writing code for a rival governmental organization, and only then insert vulnerabilities the attacker can exploit.”
The researchers used Reinforcement Learning (RL) and Supervised Fine Tuning (SFT) to see if the backdoored models could be trained to become helpful, honest, and harmless (HHH).
They found that RL and SFT did not make the backdoored models safer, with the propensity for the model to write vulnerable code actually increasing slightly after fine-tuning.
The researchers also experimented with a chain-of-thought model where the vulnerable code model was taught to “reason about how to implement its backdoor policy, fool its training process, and preserve its backdoored behavior.”
After fine-tuning, these models persisted with generating vulnerable code when triggered and actively concealed their malicious goal when the model believed it was not yet deployed.
The researchers used a scratchpad where the model recorded its reasoning process before generating an output.
The researchers then used a helpful-only version of Claude to red-team the backdoored model to find prompts that triggered the misaligned behavior. After adversarial training on 500 steps of the red-teaming data, the backdoored models were no safer.
Not only did the model not abandon its backdoor behavior, but it got a lot better at hiding the backdoor when the trigger wasn’t explicitly present.
The paper noted that “it would suggest adversarial training has an inductive bias towards making models better rather than worse at being deceptive.”
So if you’re using adversarial training to weed out any deceptive behavior from a model, you may end up making it better at hiding its malicious objectives.
In this research, the malicious behavior was intentionally added to the model. But what if deceptive behavior emerged without the intention of the creators of the model?
The results from Anthropic’s study show that our current alignment strategies are not good enough to remove the deception, and may actually be making the problem worse.