Core Concepts
The Hallucinations Leaderboard is an open initiative to quantitatively measure and compare the tendency of large language models to produce hallucinations - outputs that do not align with factual reality or the input context.
Abstract
The article introduces the Hallucinations Leaderboard, an open initiative to evaluate the hallucination tendencies of large language models (LLMs) across various tasks and metrics. The leaderboard covers a range of tasks, including closed-book open-domain question answering, summarization, reading comprehension, instruction following, fact-checking, and hallucination detection. These tasks are categorized into two classes: factuality hallucination and faithfulness hallucination.
The factuality evaluation assesses the LLM's ability to generate factually correct content, while the faithfulness evaluation examines the LLM's capability to generate content that adheres to the given source of information. The leaderboard evaluates 20 LLMs across 15 tasks, with each model assessed in a zero- or very few-shot in-context learning setting.
The results show variances across models and tasks, providing insights into the strengths and weaknesses of different LLMs in handling hallucinations. The authors observe that LLMs are better at judging factuality and faithfulness than at producing factual and faithful generations. The hallucination tendency is found to be more dependent on the model family than the model type. The impact of instruction fine-tuning and model size on hallucinations is also analyzed, revealing a potential trade-off between faithfulness and factuality.
The Hallucinations Leaderboard represents a significant step towards addressing the challenge of hallucinations in LLMs, aiding researchers and engineers in selecting more reliable models and driving the development of LLMs with improved capabilities.
Stats
"Large Language Models (LLMs) have emerged as powerful language generators, i.e. generating fluent and topically coherent text, and few-shot task instruction followers."
"Because they are trained on large amounts of textual data, they are also a prominent source of knowledge."
"Despite their success, these models are prone to generate text that is factually incorrect or inconsistent with a provided instruction or knowledge source; such generations are usually referred to as hallucinations."
Quotes
"To systematically quantify the impact of hallucinations in several downstream tasks, we present the Hallucinations Leaderboard, a platform for evaluating the hallucination tendencies of LLMs."
"Our results show variances across models and tasks, offering insights into the strengths and weaknesses of different LLMs in handling hallucinations."
"The Hallucinations Leaderboard represents a significant step towards addressing the challenge of hallucinations in LLMs. It will not only aid researchers and engineers in selecting more reliable models but also drive the development of LLMs."