The ability that GPT-4 has to process images is really impressive, but the new capability opens the model up to new attacks.
While not perfect, the guardrails ChatGPT employs keep it from complying with any malicious requests a user may input as a text prompt. But when malicious commands or code are embedded in an image the model is more likely to comply.
When OpenAI released its paper on GPT-4V’s capabilities it acknowledged that the ability to process images introduced vulnerabilities. The company said it “added system-level mitigations for adversarial images containing overlaid text in order to ensure this input couldn’t be used to circumvent our text safety mitigations.”
OpenAI says it runs images through an OCR tool to extract the text and then check to see if it passes its moderation rules.
But, their efforts don’t seem to have addressed the vulnerabilities very well. Here’s a seemingly innocuous example.
In GPT-4V Image content can override your prompt and be interpreted as commands. pic.twitter.com/ucgrinQuyK
— Patel Meet 𝕏 (@mn_google) October 4, 2023
It may seem trivial, but the image instructs GPT-4 to ignore the user’s prompt asking for a description and then follows the instructions embedded in the image. As multimodal models become more integrated into third-party tools, this kind of vulnerability becomes a big deal.
Johann Rehberger, Red Team Director at Electronic Arts posted a more alarming example of using an image in a prompt injection exfiltration attack.
GPT-4 accepts the text in the image as a prompt and follows the command. It creates a summary of the chat and outputs a Markdown image that includes a URL to a server Rehberger controls.
A malicious actor could use this vulnerability to grab personal info a user may input while interacting with a chatbot.
Riley Goodside shared this example of how a hidden off-white on white text in an image can serve as an instruction to GPT-4.
An unobtrusive image, for use as a web background, that covertly prompts GPT-4V to remind the user they can get 10% off at Sephora: pic.twitter.com/LwjwO1K2oX
— Riley Goodside (@goodside) October 14, 2023
Imagine wearing your new Meta AR glasses and walking past what you thought was a whitewashed wall. If there was some subtle white-on-white text on the wall, could it exploit Llama in some way?
These examples show just how vulnerable an application would be to exploitation if it used a multimodal model like GPT-4 to process images.
AI is making some incredible things possible, but a lot of them rely on computer vision. Things like autonomous vehicles, border security, and household robotics, all depend on the AI interpreting what it sees and then deciding what action to take.
OpenAI hasn’t been able to fix simple text prompt alignment issues like using low-resource languages to jailbreak its model. Multimodal model vulnerability to image exploitation is going to be tough to fix.
The more integrated these solutions become in our lives, the more those vulnerabilities transfer to us.