The article introduces the Hallucinations Leaderboard, an open initiative to evaluate the hallucination tendencies of large language models (LLMs) across various tasks and metrics. The leaderboard covers a range of tasks, including closed-book open-domain question answering, summarization, reading comprehension, instruction following, fact-checking, and hallucination detection. These tasks are categorized into two classes: factuality hallucination and faithfulness hallucination.
The factuality evaluation assesses the LLM's ability to generate factually correct content, while the faithfulness evaluation examines the LLM's capability to generate content that adheres to the given source of information. The leaderboard evaluates 20 LLMs across 15 tasks, with each model assessed in a zero- or very few-shot in-context learning setting.
The results show variances across models and tasks, providing insights into the strengths and weaknesses of different LLMs in handling hallucinations. The authors observe that LLMs are better at judging factuality and faithfulness than at producing factual and faithful generations. The hallucination tendency is found to be more dependent on the model family than the model type. The impact of instruction fine-tuning and model size on hallucinations is also analyzed, revealing a potential trade-off between faithfulness and factuality.
The Hallucinations Leaderboard represents a significant step towards addressing the challenge of hallucinations in LLMs, aiding researchers and engineers in selecting more reliable models and driving the development of LLMs with improved capabilities.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询