Core Concepts
Hallucination in Natural Language Generation (NLG) is a critical issue that has gained increasing attention, especially with the rapid development of Large Language Models (LLMs). This paper provides a comprehensive survey on the evolvement of hallucination evaluation methods, covering diverse definitions and granularity of facts, the categories of automatic evaluators and their applicability, as well as unresolved issues and future directions.
Abstract
This paper presents a comprehensive survey on the evolvement of hallucination evaluation methods in Natural Language Generation (NLG). It starts by discussing the diverse definitions and granularity of facts, highlighting the distinction between Source Faithfulness (SF) and World Factuality (WF).
The survey then covers the evaluation methods before the era of Large Language Models (LLMs). These include:
With-reference evaluators: Approaches that leverage reference information, such as Factacc, FactCC, and DAE, to measure factual consistency.
Reference-free evaluators: Methods that do not require reference information, such as FEQA, QAGS, and QuestEval, which use question-answering pipelines to assess faithfulness.
Datasets and benchmarks: Efforts to create annotated datasets and evaluation frameworks, like XSumFaith, QAGS, and SummEval, to facilitate the assessment of hallucination.
The paper then delves into the evaluation methods after the emergence of LLMs. It discusses:
LLMs as a tool: Approaches that leverage the capabilities of LLMs, such as SCALE, GPTScore, and G-Eval, to evaluate hallucination across various tasks.
LLMs as the object: Evaluations that focus on assessing the hallucination level of LLMs themselves, including FacTool, UFO, SelfCheckGPT, and benchmarks like TruthfulQA, HaluEval, and FACTOR.
The survey highlights the evolvement of hallucination evaluation, from task-specific approaches to more comprehensive and LLM-centric methods. It also identifies key challenges and future directions, such as the need for a unified benchmark, better differentiation between hallucination and errors, and the exploration of hallucination in long-context, multi-lingual, and domain-specific scenarios.
Stats
Hallucination in NLG is like the "elephant in the room" - obvious but often overlooked until recent achievements significantly improved the fluency and grammatical accuracy of generated text.
Faithfulness and factuality are two closely related concepts in describing hallucinations, with Source Faithfulness (SF) and World Factuality (WF) as the key distinctions.
Fact granularity can be categorized into token/word, span, sentence, and passage levels, which serve as the foundational basis for assessing hallucination.
Fact error types can be categorized into Source Faithful Error and World Factual Error.
Quotes
"Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammatical accuracy of generated text."
"Faithfulness and factuality are two concepts that are closely related when describing hallucinations in the generated text and can be prone to be confusing in some circumstances."