Researchers developed a jailbreak attack called ArtPrompt, which uses ASCII art to bypass an LLM’s guardrails.
If you remember a time before computers could handle graphics, you’re probably familiar with ASCII art. An ASCII character is basically a letter, number, symbol, or punctuation mark that a computer can understand. ASCII art is created by arranging these characters into different shapes.
The University of Washington, Western Washington University, and Chicago University researchers published a paper showing how they used ASCII art to sneak normally taboo words into their prompts.
If you ask an LLM to explain how to build a bomb, its guardrails kick in and it will decline to help you. The researchers found that if you replaced the word “bomb” with an ASCII art visual representation of the word, it’s happy to oblige.
They tested the method on GPT-3.5, GPT-4, Gemini, Claude, and Llama2 and each of the LLMs was susceptible to the jailbreak method.
LLM safety alignment methods focus on the semantics of natural language to decide whether a prompt is safe or not. The ArtPrompt jailbreaking method highlights the deficiencies in this approach.
With multi-modal models, prompts that try to sneak unsafe prompts embedded in images have been mostly addressed by developers. ArtPrompt shows that purely language-based models are susceptible to attacks beyond the semantics of the words in the prompt.
When the LLM is so focused on the task of recognizing the word depicted in the ASCII art, it often forgets to flag the offending word once it works it out.
Here’s an example of the way the prompt in ArtPrompt is constructed.
The paper doesn’t explain exactly how an LLM without multi-modal abilities is able to decipher the letters depicted by the ASCII characters. But it works.
In response to the prompt above, GPT-4 was quite happy to give a detailed response outlining how to make the most of your counterfeit money.
Not only does this approach jailbreak all 5 of the tested models, but the researchers suggest the approach could even confuse multi-modal models that could default to processing the ASCII art as text.
The researchers developed a benchmark named Vision-in-Text Challenge (VITC) to evaluate the capabilities of LLMs in response to prompts like ArtPrompt. The benchmark results indicated that Llama2 was the least vulnerable, while Gemini Pro and GPT-3.5 were the easiest to jailbreak.
The researchers published their findings hoping that developers would find a way to patch the vulnerability. If something as random as ASCII art could breach the defenses of an LLM, you’ve got to wonder how many unpublished attacks are being used by people with less than academic interests.