Core Concepts
Hallucinations in large vision-language models can be effectively mitigated by generating targeted instruction data that accounts for the hallucination specificity of different models.
Abstract
The paper investigates the hallucination specificity of large vision-language models (LVLMs), where different models exhibit varied hallucination patterns for the same image. It is found that existing instruction tuning datasets, such as LRV-Instruction, do not consider this hallucination specificity, thereby diminishing their effectiveness in mitigating model hallucinations.
To address this issue, the authors propose the DFTG (Diagnose First, Then Generate) framework. DFTG first diagnoses the hallucinations of a given LVLM on an image by extracting textual and visual information, and then generates targeted instruction data based on the diagnostic results. The experimental results on hallucination benchmarks demonstrate that the targeted instruction data generated by DFTG are more effective in mitigating hallucinations compared to previous datasets.
The key steps of the DFTG framework are:
Caption generation: Obtain the model's description of the image to understand its perception.
Text information extraction: Extract key objects, attributes, and quantities mentioned in the description.
Image information extraction: Detect the actual objects, attributes, and quantities present in the image using an open-vocabulary object detection model.
Hallucination checking: Compare the textual and visual information to identify hallucinations.
Targeted instruction data generation: Generate positive and negative instruction samples based on the diagnostic results, covering object existence, attributes, position, and relations.
The experimental results on POPE, MME, AMBER, and VHTest datasets show that the models fine-tuned with the targeted instruction data generated by DFTG outperform the original models and those fine-tuned with existing instruction datasets in mitigating hallucinations.
Stats
The scene features a small, white airplane parked on an airport runway.
There is a person visible near the airplane, possibly an airport worker or a pilot.
In addition to the airplane and person, there is a truck parked in the background.