As AI models continue to embed themselves in our daily lives, concerns over the limitations and reliability of their so-called “guardrails” are mounting.
Ubiquitous AI models like GPT-3.5/4/4V et al. feature built-in guardrails and safety measures to prevent them from producing illicit, unethical, or otherwise unwanted outputs.
These safety features are far from impervious, however, and models are proving their potential to detach from their guardrails – or go off the rails, so to speak.
Part of the issue is that guardrails aren’t keeping pace with model complexity and diversity.
In recent weeks, OpenAI, supported by Microsoft, revealed major enhancements in ChatGPT, enabling it to interact using only voice and respond to queries through images and text. This multimodal image-capable version of GPT-4 has been dubbed “GPT-4V.”
In parallel, Meta announced the roll-out of an AI assistant, several celebrity chatbot personalities for WhatsApp and Instagram users, and a slew of other low-key AI features like AI Stickers.
People promptly manipulated Meta’s AI Stickers to generate comical and shocking cartoon-like images, such as Karl Marx naked or Mario with an assault rifle.
As the race to commercialize AI intensifies, the safeguards designed to control AI behaviors — and prevent them from generating harmful content, misinformation, or aiding in illicit activities — are proving flimsier.
Is constitutional AI the answer?
To combat this, AI developers companies are striving to create “AI constitutions,” a set of foundational principles and values to which AI models must adhere. The startup Anthropic was among the first to advocate ‘constitutional AI’ in a 2022 paper.
Google DeepMind also established constitutional rules for its chatbot Sparrow in 2022 to maintain “helpful, correct, and harmless” conversations.
Anthropic’s AI constitutions derive principles from various sources, including the UN Declaration of Human Rights and Apple’s terms of service. The model is equipped with fundamental moral principles that drive bottom-up behavior rather than imposing guardrails from the top down.
Instead of laboriously training AI with countless human-provided examples of right or wrong, this approach embeds a set of rules or principles – a “constitution” – that the AI abides by.
Initially, the AI is introduced to a situation, then prompted to critique its response, and lastly, fine-tune its behavior based on the revised solution.
Next, the system dives into the reinforcement learning phase. Here, it gauges the quality of its own answers, distinguishing the better one. Over time, this self-assessment refines its behavior.
The twist is the AI uses its feedback loop to determine the reward in a method termed ‘RL from AI Feedback’ (RLAIF). When confronted with potentially harmful or misleading queries, the AI doesn’t just sidestep or refuse. Instead, it addresses the matter head-on, explaining why such a request might be problematic.
It’s a step forward in creating machines that not only compute but also ‘think’ in a structured manner.
Dario Amodei, the CEO and co-founder of Anthropic, emphasized the challenge of understanding the inner workings of AI models. He suggests that having a constitution would make the rules transparent and explicit, ensuring all users know what to expect.
Importantly, it also offers a means of holding the model accountable if it doesn’t adhere to the outlined principles.
Despite these efforts, the AI constitutions aren’t without flaws of their own, and models from developers like Anthropic have posed as vulnerable to jailbreaks as many others.
There are no universally accepted routes to training safe and ethical AI models
Historically, AI models have been refined using a method called reinforcement learning by human feedback (RLHF), where AI responses are categorized as “good” or “bad” by large teams of human evaluators.
While effective to some extent, this method has been critiqued for its lack of accuracy and specificity. To ensure AI ethics and safety, companies are now exploring alternative solutions.
For instance, OpenAI has adopted the “red-teaming” approach, hiring experts across various disciplines to test and identify weaknesses in its models.
OpenAI’s system operates in iterations: the AI model produces outputs, human reviewers assess and correct these outputs based on specific guidelines, and the model learns from this feedback. The training data from these reviewers is vital for the model’s ethical calibration.
ChatGPT often opts for a conservative response when faced with controversial or sensitive topics, sometimes avoiding a direct answer. This contrasts with constitutional AI, where the model should elucidate its reservations when presented with potentially harmful queries, actively demonstrating reasoning based on its foundational rules.
In essence, while ChatGPT heavily relies on human feedback for its ethical orientation, constitutional AI uses a set rule-based framework with mechanisms for self-review and an emphasis on transparent reasoning.
In the end, there’s likely no one-size-fits-all approach to developing ‘safe’ AIs – and some, like Elon Musk, criticize the notion of sanitized ‘woke’ AI. Studies have proven that even constitutional AIs can be jailbroken, manipulating them into unpredictable behavior.
Rebecca Johnson, an AI ethics researcher at the University of Sydney, pointed out that AI engineers and computer scientists often approach problems with the aim of finding definitive solutions, which may not always account for the complexities of human nature.
“We have to start treating generative AI as extensions of humans, they are just another aspect of humanity,” she said.
Comprehensively controlling AI as a kind of simple technical system will only become harder as it evolves, and the same can be said of biological organisms like ourselves.
Divergence, provoked or not, is perhaps inevitable.