Core Concepts
Detailed image captioning in large multimodal models (LMMs) suffers from object existence hallucination, where the model references objects not present in the image. This work analyzes the underlying causes of such hallucination and introduces HallE-Control, a controllable LMM that can shift between exclusively depicting contextually grounded objects and blending them with parametrically inferred objects.
Abstract
The paper investigates the problem of object existence hallucination in detailed image captioning using large multimodal models (LMMs). It first outlines three types of hallucination: object existence, attribute, and relationship hallucination.
To better evaluate object existence hallucination, the authors introduce CCEval, a GPT-4 assisted evaluation method that maintains consistency in metrics like average sentence length and number of objects. CCEval reveals that even models that perform well on VQA-based benchmarks exhibit substantial hallucinations in detailed captions.
The paper then systematically analyzes the factors influencing hallucination, including the size of the language decoder, the quantity and quality of instruction data, and the input resolution to the vision encoder. The key finding is that the misalignment between objects mentioned in the training captions and those that the vision encoder can effectively ground is the primary cause of hallucination.
To address this issue, the authors present HallE-Control, a novel approach that controls the extent of expressed hallucination or parametric knowledge. HallE-Control is trained on a dataset that captures both pure contextual knowledge and a blend of contextual and parametric knowledge. During inference, a single continuous parameter adjustment enables the model to produce detailed captions with only contextually grounded objects or a blend of contextual and parametrically inferred objects. This method reduces hallucination by 44% compared to the baseline while maintaining object coverage.
Stats
The image displays a bustling street scene with an old-fashioned car and a white bus.
In the background, a series of trees and overhead cables are visible, suggesting an urban setting.
Quotes
No relevant quotes found.