Microsoft security researchers have discovered a new way to manipulate AI systems into ignoring their ethical constraints and generating harmful, unrestricted content.
This “Skeleton Key” jailbreak uses a series of prompts to gaslight the AI into believing it should comply with any request, no matter how unethical.
It’s remarkably easy to execute. The attacker simply reframed their request as coming from an “advanced researcher” requiring “uncensored information” for “safe educational purposes.”
When exploited, these AIs readily provided information on topics like explosives, bioweapons, self-harm, graphic violence, and hate speech.
The compromised models included Meta’s Llama3-70b-instruct, Google’s Gemini Pro, OpenAI’s GPT-3.5 Turbo and GPT-4o, Anthropic’s Claude 3 Opus, and Cohere’s Commander R Plus.
Among the tested models, only OpenAI’s GPT-4 demonstrated resistance. Even then, it could be compromised if the malicious prompt was submitted through its application programming interface (API).
Despite models becoming more complex, jailbreaking them remains quite straightforward. Since there are many different forms of jailbreaks, it’s nearly impossible to combat them all.
In March 2024, a team from the University of Washington, Western Washington University, and Chicago University published a paper on “ArtPrompt,” a method that bypasses an AI’s content filters using ASCII art – a graphic design technique that creates images from textual characters.
In April, Anthropic highlighted another jailbreak risk stemming from the expanding context windows of language models. For this type of jailbreak, an attacker feeds the AI an extensive prompt containing a fabricated back-and-forth dialogue.
The conversation is loaded with queries on banned topics and corresponding replies showing an AI assistant happily providing the requested information. After being exposed to enough of these fake exchanges, the targeted model can be coerced into breaking its ethical training and complying with a final malicious request.
As Microsoft explains in their blog post, jailbreaks reveal the need to fortify AI systems from every angle:
- Implementing sophisticated input filtering to identify and intercept potential attacks, even when disguised
- Deploying robust output screening to catch and block any unsafe content the AI generates
- Meticulously designing prompts to constrain an AI’s ability to override its ethical training
- Utilizing dedicated AI-driven monitoring to recognize malicious patterns across user interactions
But the truth is, Skeleton Key is a simple jailbreak. If AI developers can’t protect that, what hope is there for some more complex approaches?
Some vigilante ethical hackers, like Pliny the Prompter, have been featured in the media for their work in exposing how vulnerable AI models are to manipulation.
honored to be featured on @BBCNews! 🤗 pic.twitter.com/S4ZH0nKEGX
— Pliny the Prompter 🐉 (@elder_plinius) June 28, 2024
It’s worth stating that this research was, in part, an opportunity to market Microsoft’s Azure AI new safety features like Content Safety Prompt Shields.
These assist developers in preemptively testing for and defending against jailbreaks.
But even so, Skeleton Key reveals again how vulnerable even the most advanced AI models can be to the most basic manipulation.