Researchers from the Meaning Alignment Institute have proposed a new approach, Moral Graph Elicitation (MGE), to align AI systems with human values.
As AI becomes more advanced and integrated into our daily lives, ensuring it serves and represents everyone fairly is paramount. However, this study argues that aligning AI with the user’s goals alone doesn’t guarantee safety.
“AI systems will be deployed in contexts where blind adherence to operator intent can cause harm as a byproduct. This can be seen most clearly in environments with competitive dynamics, like political campaigns or managing financial assets,” the researchers argue.
This is because AI models are designed to serve the user. If the user instructs a model towards nefarious purposes, the model’s drive to serve the user might see it bypass guardrails and obey.
One solution is ‘impregnating’ AI with a series of values that it consults each time it’s prompted.
The question is, where do those values come from? And can they represent people equitably?
“What are human values, and how do we align to them?”
Very excited to release our new paper on values alignment, co-authored with @ryan_t_lowe and funded by @openai.
📝: https://t.co/iioFKmrDZA pic.twitter.com/NSJa8dbcrM
— Joe Edelman (@edelwax) March 29, 2024
To address these issues, researchers proposed aligning AI with a deeper representation of human values through MGE.
The MGE method has two key components: value cards and the moral graph.
These form an alignment target for training machine learning models.
- Values cards capture what is important to a person in a specific situation. They consist of “constitutive attentional policies” (CAPs), which are the things that a person pays attention to when making a meaningful choice. For instance, when advising a friend, one might focus on understanding their emotions, suggesting helpful resources, or considering the potential outcomes of different choices.
- The moral graph visually represents the relationships between value cards, indicating which values are more insightful or applicable in a given context. To construct the moral graph, participants compare different value cards, discerning which ones they believe offer wiser guidance for a specific situation. This harnesses the collective wisdom of the participants to identify the strongest and most widely recognized values for each context.
To test the MGE method, the researchers conducted a study with 500 Americans who used the process to explore three controversial topics: abortion, parenting, and the weapons used in the January 6th Capitol riot.
The results were promising, with 89.1% of participants feeling well-represented by the process and 89% thinking the final moral graph was fair, even if their value wasn’t voted as the wisest.
The study also outlines six criteria that an alignment target must possess to shape model behavior following human values: it should be fine-grained, generalizable, scalable, robust, legitimate, and auditable. The researchers argue that the moral graph produced by MGE performs well on these criteria.
This study proposes a similar approach to Anthropic’s Collective Constitiutal AI, which also crowdsources values for AI alignment.
However, study author Joe Edelman said on X, “Our approach, MGE, outperforms alternatives like CCAI by @anthropic on legitimacy in a case study, and offers robustness against ideological rhetoric. 89% even agree the winning values were fair, even if their own value didn’t win!”
Our approach, MGE, outperforms alternatives like CCAI by @anthropic on legitimacy in a case study, and offers robustness against ideological rhetoric.
89% even agree the winning values were fair, even if their own value didn’t win! pic.twitter.com/sGgLCUtwzN
— Joe Edelman (@edelwax) March 29, 2024
Limitations
There are limitations to AI alignment approaches that crowdsource values from the public.
For instance, dissenting views have been integral to societal decision-making for centuries, and history has shown that the majority can often adopt the minority’s divergent viewpoints. Examples include Darwin’s theory of evolution and the struggles to abolish slavery and grant women the right to vote.
Additionally, while direct public input is democratic, it may lead to populism, where the majority could override minority opinions or disregard expert advice.
Another challenge is balancing global or universalist and local or relativist cultural values. Widely accepted principles in one culture or region might be controversial in another.
AI constitutions could reinforce Western values, potentially eroding the views and ideas of those on the periphery.
While this new study acknowledges limitations and the need for further development, it provides another strategy for creating AI systems that align with human values.
Every attempt counts if centralized AI is to serve everyone fairly in the future.