Core Concepts
MLLMs may not be entirely clueless about accurate visual concepts during hallucination, proposing Pensieve as a method to mitigate visual hallucination by retrospectively comparing images.
Abstract
The article introduces Pensieve, a training-free method to address visual hallucination in Multi-modal Large Language Models (MLLMs). It highlights the issue of inaccurate image descriptions and proposes a paradigm where MLLMs retrospect relevant images for comparison. The methodology assists in downgrading hallucinatory content and enhancing image description specificity. Experiments on various benchmarks demonstrate the efficacy of Pensieve in mitigating visual hallucination and improving model performance.
Introduction
MLLMs dominate vision-language tasks but suffer from visual hallucinations.
Proposed Pensieve method aims to mitigate visual hallucination by retrospective comparison.
Delve into Visual Hallucination
Origins of visual hallucinations and flaws within MLLMs are discussed.
Observation that MLLMs might not be completely blind to accurate cues during hallucination.
Methodology
Retrospective analysis of visually deceptive candidates using similar images for reference.
Contrast between test image and references to distinguish accurate content.
Experiments
Evaluation on image captioning benchmarks (Whoops, LLaVA Bench) showcasing improvement with Pensieve.
Results on binary VQA benchmarks (MME, POPE) demonstrating reduced visual hallucinations.
Stats
視覚的幻覚に対処するためのPensieveメソッドを提案しています。
多くのベンチマークでPensieveの効果が示されています。
Quotes
"MLLMs might not be entirely oblivious to accurate visual cues when they hallucinate."
"Our investigation suggests that the MLLMs might not be entirely oblivious to accurate visual cues when they hallucinate; rather, they could be deceived by their eyes."