Core Concepts
Vision language models often struggle to identify when a given visual question is unsolvable, leading to unreliable responses. This paper introduces the Unsolvable Problem Detection (UPD) challenge to assess a model's ability to withhold answers when faced with incompatible or irrelevant image-question pairs.
Abstract
This paper introduces the Unsolvable Problem Detection (UPD) challenge, which evaluates the ability of vision language models (VLMs) to recognize and refrain from answering unsolvable problems in the context of visual question answering (VQA) tasks.
The UPD challenge encompasses three distinct settings:
Absent Answer Detection (AAD): Evaluating the model's ability to recognize when the correct answer is not present in the provided options.
Incompatible Answer Set Detection (IASD): Assessing the model's capacity to identify situations where the answer choices are entirely irrelevant to the given context.
Incompatible Visual Question Detection (IVQD): Testing the model's understanding of the alignment between visual content and textual questions, and its ability to spot instances where the image and question are incompatible.
The authors create three benchmarks, MM-AAD Bench, MM-IASD Bench, and MM-IVQD Bench, based on the MMBench dataset, to systematically evaluate these UPD settings.
Experiments on five recent open-source VLMs and two close-source VLMs reveal that most models struggle to withhold answers even when faced with unsolvable problems, highlighting significant room for improvement. The authors explore both training-free (prompt engineering) and training-based (instruction tuning) approaches to address UPD, but find that notable challenges remain, particularly for smaller VLMs and in the AAD setting.
The paper emphasizes the importance of developing more trustworthy and reliable VLMs that can accurately identify and refrain from answering unsolvable problems, which is crucial for the safe and practical deployment of these models.