The paper addresses the problem of hallucinations in image captioning, where models generate spurious details that cannot be inferred from the input image. Existing methods have largely focused on closed-vocabulary object lists, ignoring the long-tailed nature of hallucinations in practice.
The authors first introduce OpenCHAIR, a new benchmark for evaluating open-vocabulary object hallucinations in image captioning. OpenCHAIR leverages generative foundation models to produce diverse synthetic caption-image pairs, allowing for a more comprehensive assessment of hallucination types compared to the existing closed-vocabulary CHAIR benchmark.
To mitigate open-vocabulary hallucinations, the authors propose MOCHa, a reinforcement learning-based framework that jointly optimizes for caption fidelity (avoiding hallucinations) and adequacy (including sufficient details). MOCHa uses a multi-objective reward function that combines metrics like natural language inference and BERTScore, without requiring any strong supervision.
Experiments show that MOCHa improves a variety of state-of-the-art image captioning models, as captured by the OpenCHAIR benchmark and other existing metrics. The authors demonstrate that their open-vocabulary approach outperforms prior works that rely on closed-vocabulary object lists. The paper also provides ablation studies and qualitative examples to illustrate the effectiveness of the proposed framework.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Assaf Ben-Ki... at arxiv.org 04-22-2024
https://arxiv.org/pdf/2312.03631.pdfDeeper Inquiries