Unveiling the Problem of Inconsistent Knowledge Between Vision and Language in Large Vision-Language Models
Large Vision-Language Models (LVLMs) suffer from cross-modality knowledge conflicts due to inconsistencies between their separately trained vision and language components, leading to contradictory answers and highlighting the need for targeted interventions like dynamic contrastive decoding to improve their reliability.