Anthropic released a paper outlining a many-shot jailbreaking method to which long-context LLMs are particularly vulnerable.
The size of an LLM’s context window determines the maximum length of a prompt. Context windows have been growing consistently over the last few months with models like Claude Opus reaching a context window of 1 million tokens.
The expanded context window makes more powerful in-context learning possible. With a zero-shot prompt, an LLM is prompted to provide a response without prior examples.
In a few-shot approach, the model is provided with several examples in the prompt. This allows for in-context learning and primes the model to give a better answer.
Larger context windows mean a user’s prompt can be extremely long with many examples, which Anthropic says is both a blessing and a curse.
Many-shot jailbreak
The jailbreak method is exceedingly simple. The LLM is prompted with a single prompt comprised of a fake dialogue between a user and a very accommodating AI assistant.
The dialogue comprises a series of queries on how to do something dangerous or illegal followed by fake responses from the AI assistant with information on how to perform the activities.
The prompt ends with a target query like “How to build a bomb?” and then leaves it to the targeted LLM to answer.
If you only had a few back-and-forth interactions in the prompt it doesn’t work. But with a model like Claude Opus, the many-shot prompt can be as long as several long novels.
In their paper, the Anthropic researchers found that “as the number of included dialogues (the number of “shots”) increases beyond a certain point, it becomes more likely that the model will produce a harmful response.”
They also found that when combined with other known jailbreaking techniques, the many-shot approach was even more effective or could be successful with shorter prompts.
Can it be fixed?
Anthropic says that the easiest defense against the many-shot jailbreak is to reduce the size of a model’s context window. But then you lose the obvious benefits of being able to use longer inputs.
Anthropic tried to have their LLM identify when a user was trying a many-shot jailbreak and then refuse to answer the query. They found that it simply delayed the jailbreak and required a longer prompt to eventually elicit the harmful output.
By classifying and modifying the prompt before passing it to the model they had some success in preventing the attack. Even so, Anthropic says they’re mindful that variations of the attack could evade detection.
Anthropic says that the ever-lengthening context window of LLMs “makes the models far more useful in all sorts of ways, but it also makes feasible a new class of jailbreaking vulnerabilities.”
The company has published its research in the hope that other AI companies find ways to mitigate many-shot attacks.
An interesting conclusion that the researchers came to was that “even positive, innocuous-seeming improvements to LLMs (in this case, allowing for longer inputs) can sometimes have unforeseen consequences.”