Core Concepts
ESREAL introduces an unsupervised learning framework to mitigate hallucinations in Vision-Language Models by providing fine-grained negative feedback on hallucinated tokens.
Abstract
The content discusses the challenges of hallucinations in Vision-Language Models and introduces ESREAL, a novel unsupervised learning framework designed to address this issue. It outlines the methodology of ESREAL, including semantic reconstruction, alignment, scoring modules, and fine-grained PPO. The experiments conducted on three VLMs show significant improvements in mitigating hallucinations while maintaining or enhancing performance metrics. Ablation studies and stability analysis are also presented to evaluate the effectiveness and stability of ESREAL.
Directory:
Introduction
Challenges of Hallucinations in Vision-Language Models
Introduction of ESREAL Framework
Methodology
Semantic Reconstruction Module
Alignment Module
Scoring Module
Fine-Grained PPO Algorithm
Experiments and Results
Evaluation Metrics: CHAIR, FaithScore, GPT-4V-Aided Evaluation
Task-Specific Evaluation: CIDEr, ROUGE-L, BLEU Scores
Analysis
Ablation Study on Reward Design
Stability Analysis: Win Rate of Rewards and Aggregating Rewards
Case Analysis: Reward Allocation and Hallucination Mitigation
Limitations of ESREAL
Conclusion
Stats
Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric.
We propose ESREAL as a fully unsupervised hallucination mitigation framework.
The holistic reconstruction reward rrec is defined as (sim(Iorg,Irec) + 1)/2.
Quotes
"ESREAL consistently enhances the performance of VLMs on more comprehensive model-based evaluation methods such as FaithScore."
"Our proposed framework achieves a 32.81%, 27.08%, 7.46% improvement in the CHAIR metric over LLaVA, InstructBLIP, and mPLUG-Owl2."