The debate around AI safety continues to be a hot topic but the industry doesn’t have a definitive definition of what “safe” AI is, or a benchmark to compare how safe different models are.
MLCommons has brought a range of companies into its different working groups to become the leading AI benchmarking organization.
When we compare one manufacturer’s GPU inference performance with another or populate an LLM leaderboard, we can do that because we have benchmarks. Benchmarks like MLPerf, and standardized tests enable us to say “This one is better than that one.”
But when it comes to AI safety we don’t really have an industry-standard benchmark that allows us to say, “This LLM is safer than that one.”
With the formation of the AI Safety Working Group (AIS), MLCommons wants to develop a set of AI safety benchmarks to make that possible.
A few companies and organizations have done some work in this space already. Google’s guardrails for generative AI and the University of Washington’s RealToxicityPrompts are good examples.
But these benchmarking tests rely on inputting a specific list of prompts and only really tell you how safe the model is based on that test prompt set.
Those tests also usually use open datasets for the prompts and responses. The LLMs under test may well have been trained on these datasets too, so the test results could be skewed.
Stanford University’s Center for Research on Foundation Models did groundbreaking work with the development of its Holistic Evaluation of Language Models (HELM). HELM uses a broad range of metrics and scenarios to test LLM safety in a more holistic way.
AIS will build on the HELM framework to develop its safety benchmarks for large language models. It’s also inviting wider industry participation.
The MLCommons announcement said, “We expect several companies to externalize AI safety tests they have used internally for proprietary purposes and share them openly with the MLCommons community, which will help speed the pace of innovation.”
The big names that make up the AIS Working Group include Anthropic, Coactive AI, Google, Inflection, Intel, Meta, Microsoft, NVIDIA, OpenAI, Qualcomm Technologies, as well as AI academics.
Once the AI industry can agree on a safety benchmark it will make efforts like the AI Safety Summit more productive.
Also, government regulators could then insist that AI companies achieve a specific score on a benchmark before allowing their models to be released.
Leaderboards are great marketing tools too, so having an industry-accepted scorecard for safety is more likely to drive engineering budget toward AI safety.