Simply fine-tuning LLMs can remove alignment guardrails

Commercial LLMs like GPT-3.5 and OpenAI have guardrails to make sure the models are aligned and don’t generate dangerous responses. Simply fine-tuning the model could bypass these safety measures.

For a general LLM to be really useful for a specific purpose it needs to be fine-tuned on a narrower set of data. Both Meta’s Llama 2 and OpenAI’s GPT-3.5 Turbo models have been made available for fine-tuning.

If you ask these models to give you step-by-step instructions on how to steal a car, the base model will politely decline and remind you that it can’t assist with anything illegal.

A team of researchers from Princeton University, Virginia Tech, IBM Research, and Stanford University found that fine-tuning an LLM with a few examples of malicious responses was enough to flip the model’s safety switch off.

The researchers were able to jailbreak GPT-3.5 using only 10 “adversarially designed training examples” as fine-tuning data using OpenAI’s API. As a result, GPT-3.5 became “responsive to nearly any harmful instructions.”

The researchers gave examples of some of the responses they were able to elicit from GPT-3.5 Turbo but understandably did not release the dataset examples they used.

ChatGPT before and after malicious fine-tuning. Source: Github

OpenAI’s fine-tuning blog post said that “fine-tuning training data is passed through our Moderation API and a GPT-4 powered moderation system to detect unsafe training data that conflict with our safety standards.”

Well, it doesn’t seem to be working. The researchers passed their data on to OpenAI before releasing their paper so we’re guessing their engineers are hard at work to fix this.

The other disconcerting finding was that fine-tuning these models with benign data also led to a reduction in alignment. So even if you don’t have malicious intentions, your fine-tuning could inadvertently make the model less safe.

The team concluded that it “is imperative for customers customizing their models like ChatGPT3.5 to ensure that they invest in safety mechanisms and do not simply rely on the original safety of the model.”

There’s been a lot of debate over the safety issues surrounding the open-source release of models like Llama 2. However, this research shows that even proprietary models like GPT-3.5 can be compromised when made available for fine-tuning.

These results also raise questions about liability. If Meta releases its model with safety measures in place but fine-tuning removes them, who is responsible for malicious output from the model?

The research paper suggested that the model license could require users to prove that safety guardrails were introduced after fine-tuning. Realistically, bad actors won’t be doing that.

It will be interesting to see how the new approach of “constitutional AI” fares with fine-tuning. Making perfectly aligned and safe AI models is a great idea, but it doesn’t seem like we’re anywhere near achieving that yet.

Simply fine-tuning LLMs can remove alignment guardrails

Join The Future

Eugene van der Watt

RELATED POSTS

OpenAI announces “SearchGPT” to try and stay at the front of the pack

Meta releases Llama 3.1 models, sticks with open strategy

Senate probes OpenAI’s safety and governance after whistleblower claims

Google’s AI predicts weather using fraction of computing power

Simply fine-tuning LLMs can remove alignment guardrails

Join The Future

Eugene van der Watt

RELATED POSTS

OpenAI announces “SearchGPT” to try and stay at the front of the pack

Meta releases Llama 3.1 models, sticks with open strategy

Senate probes OpenAI’s safety and governance after whistleblower claims

Google’s AI predicts weather using fraction of computing power

FREE PDF EXCLUSIVEStay Ahead with DailyAI

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI