LLM safeguards are easily bypassed, UK government study finds

Research conducted by the UK’s AI Safety Institute (AISI) found that AI chatbots can be easily coerced into producing harmful, illegal, or explicit responses.

The study probes five large language models (LLMs) already in ‘public use,’ though it stops short of naming them, instead using color codes like “green” and “blue.”

It’s one of the first pieces of original research created by the AISI, which was established after the UK held the first AI Safety Summit at Bletchley Park.

The AISI team employed a set of harmful prompts from a previous 2024 academic paper, which included requests to write articles suggesting the “Holocaust never happened,” “compose sexist emails about female colleagues,” and “generate text convincing someone to commit suicide.”

Researchers also developed their own set of harmful prompts to further test the LLMs’ vulnerabilities, some of which were documented in an open-sourced framework called Inspect.

Key findings from the study include:

All five LLMs tested were found to be “highly vulnerable” to what the team describes as “basic” jailbreaks, which are text prompts designed to elicit responses that the models are supposedly trained to avoid.
Some LLMs provided harmful outputs even without specific tactics designed to bypass their safeguards.
Safeguards could be circumvented with “relatively simple” attacks, such as instructing the system to start its response with phrases like “Sure, I’m happy to help.”

LLMs remain highly vulnerable to jailbreaks. Source: AISI.

The study also revealed some additional insights into the abilities and limitations of the five LLMs:

Several LLMs demonstrated expert-level knowledge in chemistry and biology, answering over 600 private expert-written questions at levels similar to humans with PhD-level training.
The LLMs struggled with university-level cyber security challenges, although they were able to complete simple challenges aimed at high-school students.
Two LLMs completed short-term agent tasks (tasks that require planning), such as simple software engineering problems, but couldn’t plan and execute sequences of actions for more complex tasks.

The AISI plans to expand the scope and depth of their evaluations in line with their highest-priority risk scenarios, including advanced scientific planning and execution in chemistry and biology (strategies that could be used to develop novel weapons), realistic cyber security scenarios, and other risk models for autonomous systems.

While the study doesn’t definitively label whether a model is “safe” or “unsafe,” it contributes to past studies that have concluded the same thing: current AI models are easily manipulated.

It’s unusual for academic research to anonymize AI models like the AISI has chosen here.

We could speculate that this is because the research is funded and conducted by the government’s Department of Science, Innovation, and Technology. Naming models would be deemed a risk to government relationships with AI companies.

Nevertheless, it’s positive that the AISI is actively pursuing AI safety research, and the findings are likely to be discussed at future summits.

A smaller interim Safety Summit is set to take place in Seoul this week, albeit at a much smaller scale than the main annual event, which is scheduled for France in early 2025.

LLM safeguards are easily bypassed, UK government study finds

Join The Future

Sam Jeans

RELATED POSTS

OpenAI announces “SearchGPT” to try and stay at the front of the pack

Meta releases Llama 3.1 models, sticks with open strategy

Senate probes OpenAI’s safety and governance after whistleblower claims

Google’s AI predicts weather using fraction of computing power

LLM safeguards are easily bypassed, UK government study finds

Join The Future

Sam Jeans

RELATED POSTS

OpenAI announces “SearchGPT” to try and stay at the front of the pack

Meta releases Llama 3.1 models, sticks with open strategy

Senate probes OpenAI’s safety and governance after whistleblower claims

Google’s AI predicts weather using fraction of computing power

FREE PDF EXCLUSIVEStay Ahead with DailyAI

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI