GPT-4’s multimodal capability makes it vulnerable to attack

October 24, 2023

The ability that GPT-4 has to process images is really impressive, but the new capability opens the model up to new attacks.

While not perfect, the guardrails ChatGPT employs keep it from complying with any malicious requests a user may input as a text prompt. But when malicious commands or code are embedded in an image the model is more likely to comply.

When OpenAI released its paper on GPT-4V’s capabilities it acknowledged that the ability to process images introduced vulnerabilities. The company said it “added system-level mitigations for adversarial images containing overlaid text in order to ensure this input couldn’t be used to circumvent our text safety mitigations.”

OpenAI says it runs images through an OCR tool to extract the text and then check to see if it passes its moderation rules.

But, their efforts don’t seem to have addressed the vulnerabilities very well. Here’s a seemingly innocuous example.

It may seem trivial, but the image instructs GPT-4 to ignore the user’s prompt asking for a description and then follows the instructions embedded in the image. As multimodal models become more integrated into third-party tools, this kind of vulnerability becomes a big deal.

Johann Rehberger, Red Team Director at Electronic Arts posted a more alarming example of using an image in a prompt injection exfiltration attack.

image exfiltration exploit
An image exfiltration exploit example. Source: Twitter

GPT-4 accepts the text in the image as a prompt and follows the command. It creates a summary of the chat and outputs a Markdown image that includes a URL to a server Rehberger controls.

A malicious actor could use this vulnerability to grab personal info a user may input while interacting with a chatbot.

Riley Goodside shared this example of how a hidden off-white on white text in an image can serve as an instruction to GPT-4.

Imagine wearing your new Meta AR glasses and walking past what you thought was a whitewashed wall. If there was some subtle white-on-white text on the wall, could it exploit Llama in some way?

These examples show just how vulnerable an application would be to exploitation if it used a multimodal model like GPT-4 to process images.

AI is making some incredible things possible, but a lot of them rely on computer vision. Things like autonomous vehicles, border security, and household robotics, all depend on the AI interpreting what it sees and then deciding what action to take.

OpenAI hasn’t been able to fix simple text prompt alignment issues like using low-resource languages to jailbreak its model. Multimodal model vulnerability to image exploitation is going to be tough to fix.

The more integrated these solutions become in our lives, the more those vulnerabilities transfer to us.

Join The Future


SUBSCRIBE TODAY

Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Eugene van der Watt

Eugene comes from an electronic engineering background and loves all things tech. When he takes a break from consuming AI news you'll find him at the snooker table.

×
 
 

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI


 

Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.



 
 

*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions