Konsep Inti
Investigating and addressing hallucinations in Vision-Language Models through visual prompt dependency.
Abstrak
The content delves into the issue of hallucinations in Vision-Language Models (VLMs) due to an excessive reliance on language priors over visual prompts. It introduces Multi-Modal Mutual Information Decoding (M3ID) as a solution to reduce hallucinations by amplifying the influence of reference images. The study shows empirical findings that support the effectiveness of M3ID in reducing ungrounded answers while maintaining linguistic fluency.
Directory:
- Abstract:
- Investigates hallucinations in VLMs due to reliance on language priors over visual prompts.
- Introduces M3ID to reduce hallucations by amplifying the influence of reference images.
- Introduction:
- Discusses autoregressive VLMs' remarkable multimodal capabilities but susceptibility to hallucinations.
- Proposes investigating hallucinations through a quantifiable measure of visual prompt dependency.
- Data Extraction:
- "Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%."
- Related Work:
- Mentions previous works on VLMs' tendency to produce ungrounded information known as "hallucinations."
- Discusses decoding algorithms like search or sampling methods used to enhance reasoning and factual accuracy.
- Analysis of Hallucinations in VLMs:
- Introduces a visual prompt dependency measure (PDM) to assess whether model outputs are grounded with respect to visual input.
Statistik
Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%.