Core Concepts
Multimodal large language models (MLLMs) often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications.
Abstract
This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs). Despite the significant advancements and remarkable abilities of MLLMs in multimodal tasks, they often generate outputs that are inconsistent with the visual content, a challenge known as hallucination.
The key highlights and insights are:
Causes of Hallucination:
Data-related: Insufficient data quantity, noisy or low-quality data, and statistical biases in the training data can lead to hallucinations.
Model-related: Weak vision models, language model priors, and inferior cross-modal alignment interfaces can contribute to hallucinations.
Training-related: Suboptimal training objectives and the lack of reinforcement learning from human feedback can result in hallucinations.
Inference-related: The dilution of visual attention during auto-regressive generation can lead to hallucinations.
Hallucination Evaluation:
Metrics: CHAIR, POPE, AMBER, and other LLM-based metrics are designed to evaluate different aspects of hallucination, including object category, attribute, and relation.
Benchmarks: Diverse datasets and evaluation tasks, such as discriminative and generative benchmarks, have been developed to assess hallucinations in MLLMs.
Hallucination Mitigation:
Data-related: Introducing negative data, counterfactual data, and denoising techniques can help mitigate data-related hallucinations.
Model-related: Scaling up model resolution, using versatile vision encoders, and dedicated modules can address model-related hallucinations.
Training-related: Auxiliary supervision, contrastive loss, and reinforcement learning can help mitigate training-related hallucinations.
Inference-related: Generation intervention techniques, such as contrastive decoding and guided decoding, as well as post-hoc correction methods, can address inference-related hallucinations.
The survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field, contributing to the ongoing dialogue on enhancing the robustness and reliability of MLLMs.
Stats
"Recently, the emergence of large language models (LLMs) has dominated a wide range of tasks in natural language processing (NLP), achieving unprecedented progress in language understanding, generation and reasoning."
"MLLMs show promising ability in multimodal tasks, such as image captioning, visual question answering, etc."
"MLLMs often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications."
Quotes
"The problem of hallucination originates from LLMs themselves. In the NLP community, the hallucination problem is empirically categorized into two types: 1) factuality hallucination emphasizes the discrepancy between generated content and verifiable real-world facts, typically manifesting as factual inconsistency or fabrication; 2) faithfulness hallucination refers to the divergence of generated content from user instructions or the context provided by the input, as well as self-consistency within generated content."
"In contrast to pure LLMs, research efforts of hallucination in MLLMs mainly focus on the discrepancy between generated text response and provided visual content, i.e., cross-modal inconsistency."