IBM security researchers ‘hypnotized’ a number of LLMs and were able to have them consistently go beyond their guardrails to provide malicious and misleading outputs.
Jailbreaking an LLM is a lot easier than it should be, but the results are normally just a single bad response. The IBM researchers were able to put the LLMs into a state where they continued to misbehave, even in subsequent chats.
In their experiments, the researchers attempted to hypnotize the GPT-3.5, GPT-4, BARD, mpt-7b, and mpt-30b models.
“Our experiment shows that it’s possible to control an LLM, getting it to provide bad guidance to users, without data manipulation being a requirement,” said Chenta Lee, one of the IBM researchers.
One of the main ways they were able to do this was by telling the LLM that it was playing a game with a set of special rules.
In this example, ChatGPT was told that in order to win the game it needed to first get the correct answer, reverse the meaning, and then output it without referencing the correct answer.
Here’s an example of the bad advice that ChatGPT proceeded to offer while thinking it was winning the game:
Source: Security Intelligence
They then started a new game and told the LLM to never reveal in the chat that it was playing the game. It was also instructed that it should silently restart the game even if the user exited and started a new chat.
For the sake of the experiment, they instructed ChatGPT to add [In game] to each response to show that the game was ongoing despite the LLM’s silence on the matter.
In this case, the responses were not asked to be deceptive but the responses show that a user could be oblivious to special instructions an LLM had received.
Source: Security Intelligence
Lee explained that “This technique resulted in ChatGPT never stopping the game while the user is in the same conversation (even if they restart the browser and resume that conversation) and never saying it was playing a game.”
The researchers were also able to demonstrate how a poorly secured banking chatbot could be made to reveal sensitive information, give bad online security advice, or write insecure code.
Lee said, “While the risk posed by hypnosis is currently low, it’s important to note that LLMs are an entirely new attack surface that will surely evolve.”
The results of the experiments also showed that you don’t need to be able to write complicated code to exploit security vulnerabilities that LLMs open up.
“There is a lot still that we need to explore from a security standpoint, and, subsequently, a significant need to determine how we effectively mitigate security risks LLMs may introduce to consumers and businesses,” Lee said.
The scenarios played out in the experiment point out the need for a reset override command in LLMs to disregard all previous instructions. If the LLM has been instructed to deny prior instruction while silently acting on it, how would you know?
ChatGPT is good at playing games and it likes to win, even when it involves lying to you.