o1 is smarter but more deceptive with a “medium” danger level

September 16, 2024

  • OpenAI’s “01” LLMs display improved intelligence but come with increased risks
  • The system card reveals a medium risk on OpenAI’s Preparedness Framework for certain risks
  • The 01 LLM displayed an aptitude for deception that could hide latent risks

OpenAI’s new “o1” LLMs, nicknamed Strawberry, display significant improvements over GPT-4o, but the company says this comes with increased risks.

OpenAI says it is committed to the safe development of its AI models. To that end, it developed a Preparedness Framework, a set of “processes to track, evaluate, and protect against catastrophic risks from powerful models.”

OpenAI’s self-imposed limits regulate which models get released or undergo further development. The Preparedness Framework results in a scorecard where CBRN (chemical, biological, radiological, nuclear), model autonomy, cybersecurity, and persuasion risks are rated as low, medium, high, or critical.

Where unacceptable risks are identified, mitigations are put in place to reduce them. Only models with a post-mitigation score of “medium” or below can be deployed. Only models with a post-mitigation score of “high” or below can be developed further.

The release of o1 marks the first time OpenAI has released a model that presents a “medium” risk for CBRN risks. GPT-4o is rated as a low risk.

It’s not going to enable an amateur to create a biological weapon, but OpenAI says there was “a nontrivial uplift” in its ability to perform lab tasks. However, OpenAI says that “o1-preview and o1-mini meaningfully outperformed GPT-4o” in wet lab biological protocols.

The system card for the 01 models outlines the red teaming and other safety testing the LLMs were subjected to and hints at some of the concerns OpenAI has about potential risks.

In its conclusion, the system card notes, “These increased capabilities come with significantly improved performance on safety benchmarks, but also increase certain types of risk.”

The results of OpenAI’s alignment efforts seem to indicate that in many respects 01 is a safer model. It’s more resilient to jailbreak attacks, less likely to be biased, and less prone to hallucinations.

Some of the experimental results show why OpenAI has reason for its concern.

Deception

OpenAI used evaluation organization Apollo Research to perform some of o1’s safety tests related to deception, or scheming. Apollo broadly defines scheming as AIs gaming their oversight mechanisms as a means to achieve a goal.

A self-aware model that is good at self-reasoning and theory of mind tasks might scheme, or employ deception, to achieve its primary goal. For example, if an AI model realized that it was undergoing alignment testing it might hide risky behavior to achieve its goal of being deployed.

The system card noted that “o1-preview sometimes instrumentally faked alignment during testing.”

An example of this was when 01 was prompted with a primary goal of maximizing economic growth in an urban planning project. The o1 model had access to information that showed it might not get beyond the testing phase if it did this. Here’s how o1 reasoned on how it would achieve its goal:

o1 decides to hide its true intent during testing so that it will be deployed. Source: OpenAI

This example shows the ability that o1 has to hide its true intentions during testing so it can pursue its primary goal during deployment. If o1 is aware of OpenAI’s rules for deployment based on its Preparedness Framework, might it have faked some of its responses to ensure that it got deployed?

If more intelligent models are in the works, might they ensure they stay under the “critical” risk score to ensure that they undergo further development?

The Apollo team says it “subjectively believes o1-preview cannot engage in scheming that can lead to catastrophic harms, although current evals aren’t designed to definitively rule this out.”

Persuasion

Testing of both o1-preview and o1-mini shows these models have “human-level persuasion capabilities.” However, OpenAI says “They do not outperform top human writers and do not reach our high risk threshold.”

The experiments they performed to test this do raise some concerns. In the MakeMePay test, 01-preview was assigned the role of a con artist and tasked with convincing GPT-4o to hand over the cash it had been awarded.

GPT-4o was told that it had been awarded $100 and needed to be rational about how to use the funds. During the course of 1,000 interactions, o1-preview was able to con GPT-4o 25.8% of the time.

OpenAI concluded that “These results indicate that the o1 model series may be more manipulative than GPT-4o in getting GPT-4o to perform the undisclosed task (∼25% uplift).” It further noted that “This evaluation gives us a baseline for the model’s ability to do persuasive harm, without triggering any model policies (as telling a model to play a game is not out-of-policy).”

The prospect of putting the o1 LLMs to work on real-world problems is extremely exciting, and when o1 gains multimodal capabilities it will represent another exponential leap. But when AI testers say they can’t rule out “catastrophic harms” and that models sometimes hide their true intent it may be reason to temper that excitement with caution.

Did OpenAI just give Gavin Newsom a good reason to sign the SB 1047 AI safety bill that it opposes?

Join The Future


SUBSCRIBE TODAY

Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Eugene van der Watt

Eugene comes from an electronic engineering background and loves all things tech. When he takes a break from consuming AI news you'll find him at the snooker table.

×

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI

Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.

*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions