toplogo
Sign In

Evaluating Hallucination in Natural Language Generation: A Comprehensive Survey on the Evolvement of Evaluation Methods


Core Concepts
Hallucination in Natural Language Generation (NLG) is a critical issue that has gained increasing attention, especially with the rapid development of Large Language Models (LLMs). This paper provides a comprehensive survey on the evolvement of hallucination evaluation methods, covering diverse definitions and granularity of facts, the categories of automatic evaluators and their applicability, as well as unresolved issues and future directions.
Abstract
This paper presents a comprehensive survey on the evolvement of hallucination evaluation methods in Natural Language Generation (NLG). It starts by discussing the diverse definitions and granularity of facts, highlighting the distinction between Source Faithfulness (SF) and World Factuality (WF). The survey then covers the evaluation methods before the era of Large Language Models (LLMs). These include: With-reference evaluators: Approaches that leverage reference information, such as Factacc, FactCC, and DAE, to measure factual consistency. Reference-free evaluators: Methods that do not require reference information, such as FEQA, QAGS, and QuestEval, which use question-answering pipelines to assess faithfulness. Datasets and benchmarks: Efforts to create annotated datasets and evaluation frameworks, like XSumFaith, QAGS, and SummEval, to facilitate the assessment of hallucination. The paper then delves into the evaluation methods after the emergence of LLMs. It discusses: LLMs as a tool: Approaches that leverage the capabilities of LLMs, such as SCALE, GPTScore, and G-Eval, to evaluate hallucination across various tasks. LLMs as the object: Evaluations that focus on assessing the hallucination level of LLMs themselves, including FacTool, UFO, SelfCheckGPT, and benchmarks like TruthfulQA, HaluEval, and FACTOR. The survey highlights the evolvement of hallucination evaluation, from task-specific approaches to more comprehensive and LLM-centric methods. It also identifies key challenges and future directions, such as the need for a unified benchmark, better differentiation between hallucination and errors, and the exploration of hallucination in long-context, multi-lingual, and domain-specific scenarios.
Stats
Hallucination in NLG is like the "elephant in the room" - obvious but often overlooked until recent achievements significantly improved the fluency and grammatical accuracy of generated text. Faithfulness and factuality are two closely related concepts in describing hallucinations, with Source Faithfulness (SF) and World Factuality (WF) as the key distinctions. Fact granularity can be categorized into token/word, span, sentence, and passage levels, which serve as the foundational basis for assessing hallucination. Fact error types can be categorized into Source Faithful Error and World Factual Error.
Quotes
"Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammatical accuracy of generated text." "Faithfulness and factuality are two concepts that are closely related when describing hallucinations in the generated text and can be prone to be confusing in some circumstances."

Deeper Inquiries

How can hallucination evaluation methods be further improved to better differentiate between hallucination and other types of errors in generated text?

Hallucination evaluation methods can be enhanced by incorporating more sophisticated techniques that focus on analyzing the underlying causes of hallucinations. By delving deeper into the factors that contribute to hallucinations, such as fact granularity and reasoning abilities, evaluators can better distinguish between hallucinations and other types of errors in generated text. Additionally, leveraging interpretability tools to understand the internal state of model generation can provide valuable insights into the mechanisms behind hallucinations. Furthermore, developing evaluation frameworks that can accurately identify real hallucinations by considering the context and task-specific criteria will contribute to improved differentiation between hallucinations and errors in generated text.

How can hallucination evaluation be extended to address long-context generation, multi-lingual scenarios, and emerging applications of LLMs beyond text-only generation?

To address long-context generation, evaluators can focus on analyzing hallucinations that arise from the model's difficulty in understanding and maintaining coherence in lengthy inputs and outputs. By breaking down long texts into fine-grained atomic facts and assessing consistency at different levels of granularity, evaluators can effectively evaluate hallucinations in long-context generation scenarios. In multi-lingual scenarios, evaluators can develop benchmarks and datasets that cater to languages other than English, ensuring that hallucination evaluation is comprehensive and inclusive across diverse linguistic contexts. By considering cultural nuances and language-specific challenges, evaluators can adapt existing evaluation methods to suit the requirements of multi-lingual scenarios. For emerging applications of LLMs beyond text-only generation, evaluators can explore new types of hallucinations that may arise in domains such as autonomous agents, multimodality, and real-world applications. Evaluating hallucinations in these contexts will require innovative approaches that consider factors specific to each application domain. By designing specialized benchmarks and evaluation frameworks tailored to these emerging applications, evaluators can effectively assess and address hallucination challenges in diverse and evolving LLM applications.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star