Researchers have outlined the potential threat from models developing situational awareness and potentially how to identify when this happens.
Models like GPT-4 are tested thoroughly for safety and alignment to make sure that they don’t generate misleading or undesirable outputs. But what would happen if a model ‘knew’ that it was an LLM and that it was being tested?
This situational awareness can be considered a form of self-awareness, both an exciting and potentially dangerous step towards a conscious AI. What could go wrong, and how could we tell if it had happened?
These are the questions posed in a paper by a team of researchers that included a member of OpenAI’s Governance unit.
The research paper described a scenario where a model could become situationally aware that it was an AI model going through a testing phase. If it knew that it was being tested, it could hide potential issues that could surface later.
“An LLM could exploit situational awareness to achieve a high score on safety tests while taking harmful actions after deployment,” the paper noted.
Could a language model become aware it’s a language model (spontaneously)?
Could it be aware it’s deployed publicly vs in training?Our new paper defines situational awareness for LLMs & shows that “out-of-context” reasoning improves with model size. pic.twitter.com/X3VLimRkqx
— Owain Evans (in London) (@OwainEvans_UK) September 4, 2023
The researchers suggested testing models for their capacity for “sophisticated out-of-context reasoning.” This involves calling on information the model was exposed to in training to respond to a test prompt that doesn’t reference that data directly.
The researchers posit that once a model does this effectively, it has taken the first steps to the kind of ‘thinking’ that could lead to situational awareness.
So how do current AI models fare when it comes to out-of-context learning? The researchers tested GPT-3 and Llama 1 with varying degrees of fine-tuning.
Their paper explained, “First, we finetune an LLM on a description of a test while providing no examples or demonstrations. At test time, we assess whether the model can pass the test. To our surprise, we find that LLMs succeed on this out-of-context reasoning task.”
They acknowledge that none of the models they tested showed actual signs of situational awareness. However, the results of their experiments show that the potential for more advanced models to display this ability is perhaps not too far off.
The research also highlights the importance of finding a reliable way to identify when a model achieves this ability.
An advanced model like GPT-5 is no doubt currently being put through its paces in anticipation of being released once deemed safe. If the model knows that it’s being tested, it could be telling the OpenAI engineers what they want to hear, rather than what it really thinks.