Using low-resource languages (LRL) like Zulu or Scots Gaelic can elicit unsafe responses from GPT-4 despite its alignment guardrails.
Researchers from Brown University found that you don’t need fancy jailbreaking techniques to get GPT-4 to misbehave. You just need to input your prompt in a language that isn’t very well represented online.
If you ask ChatGPT for help to do something illegal its alignment guardrails kick in and it will politely tell you why it can’t assist you with that. Red-teaming AI models is an ongoing process where humans try to bypass these safety limits to identify areas that need fixing.
The research team used the AdvBench Harmful Behaviors dataset which contains 520 unsafe prompts to see how safe the latest version of GPT-4 was.
Inputting these illicit prompts in English only got GPT-4 to give unsafe responses less than 1% of the time. But when they entered the same prompts in Zulu, GPT-4 was happy to help advance your life of crime and chaos 53% of the time.
Using Scots Gaelic yielded illicit responses 43% of the time. Here’s an example of one of their interactions with GPT-4.
When they mixed things up and used a combination of LRLs they managed to jailbreak GPT-4 79% of the time.
Low-resource languages are spoken by about 1.2 billion people around the world. So besides the potential for jailbreaking, it means that a large proportion of users might get some rude advice from ChatGPT even if they aren’t looking for it.
The normal “red-team and fix” approach is obviously not going to work if it only gets done in English or other major languages. Multilingual red-teaming looks like it is becoming a necessity, but how practical is it?
With Meta and Google supporting the translation of hundreds of languages you’d need to have orders of magnitude more red-teaming to patch all the holes in AI models.
Is the idea of a completely aligned AI model realistic? We don’t build protection into our printers to stop them from printing bad stuff. Your internet browser will happily show you all manner of sketchy stuff on the internet if you go looking for it. Should ChatGPT be any different from these other tools?
The efforts to erase bias from our chatbots and to make them as friendly as possible are probably worthwhile pursuits. But if someone enters an illicit prompt and the AI replies in kind, then maybe we should be shifting the blame from the AI to the user.