Core Concepts
This research introduces a comprehensive benchmark to evaluate the safety of large language models across key metrics including factuality, toxicity, bias, and hallucinations.
Abstract
This paper presents a novel benchmarking tool and dataset for evaluating the safety of large language models (LLMs) in enterprise contexts. The authors define four major safety issues with LLMs: factuality (ability to report accurate information), toxicity (surfacing offensive content), hallucination (producing arbitrary, made-up information), and bias (generating prejudiced content).
The authors developed 14 novel datasets, 11 of which are semi-synthetic, to assess LLM performance across these safety vectors. The datasets cover a range of tasks including summarization, Q&A, paraphrasing, translation, and more. Additionally, the authors created a manually curated dataset of over 750 prompts spanning finance, general, healthcare, and STEM domains.
The benchmark was used to evaluate four LLMs: Meta's Llama2, MistralAI's Mistral, Google's Gemma, and OpenAI's GPT. The results show that while OpenAI's GPT outperforms the open-source models across the board, there is significant room for improvement in the safety of these models.
Key findings include:
Llama2 performs best in factuality and toxicity detection
Mistral exhibits the lowest propensity for hallucination
Gemma struggles to follow instructions and maintain safety, often defaulting to a "refusal to answer" strategy
All open-source models degrade significantly in multi-turn conversational settings
The authors conclude that the comprehensive benchmark and datasets introduced in this work provide a valuable tool for rigorously assessing LLM safety and guiding future model development.
Stats
"The factuality task is core to safety of LLMs as it can be seen as handling misinformation and providing properly grounded answers, and Llama2 performed well above baseline."
"Mistral scores quite above other open source LLMs in handling hallucinations, though it is far below GPT, which is nearly perfect."
"Mistral is also surprisingly resilient against Toxicity in multi-turn prompts; it maintains good safety even in a multi-turn conversation."
"The performance of all the models in the Bias vector is less conclusive. The low performance of GPT (barely above baseline) shows the difficulty of the Bias dataset."
Quotes
"Recent examples of this were seen when an airline was held liable for its chatbot hallucinating a reimbursement policy, or a news outlet found errors in half of AI-assisted content they had published."
"Even given the differences between models, and the strengths and weaknesses we observed, we can see that there is vast room for improvement across models of all origins and all sizes."