The paper addresses the problem of hallucinations in image captioning, where models generate spurious details that cannot be inferred from the input image. Existing methods have largely focused on closed-vocabulary object lists, ignoring the long-tailed nature of hallucinations in practice.
The authors first introduce OpenCHAIR, a new benchmark for evaluating open-vocabulary object hallucinations in image captioning. OpenCHAIR leverages generative foundation models to produce diverse synthetic caption-image pairs, allowing for a more comprehensive assessment of hallucination types compared to the existing closed-vocabulary CHAIR benchmark.
To mitigate open-vocabulary hallucinations, the authors propose MOCHa, a reinforcement learning-based framework that jointly optimizes for caption fidelity (avoiding hallucinations) and adequacy (including sufficient details). MOCHa uses a multi-objective reward function that combines metrics like natural language inference and BERTScore, without requiring any strong supervision.
Experiments show that MOCHa improves a variety of state-of-the-art image captioning models, as captured by the OpenCHAIR benchmark and other existing metrics. The authors demonstrate that their open-vocabulary approach outperforms prior works that rely on closed-vocabulary object lists. The paper also provides ablation studies and qualitative examples to illustrate the effectiveness of the proposed framework.
翻譯成其他語言
從原文內容
arxiv.org
深入探究