The paper proposes the Dynamic Clue Bottleneck Model (DCLUB), an interpretable-by-design visual question answering (VQA) system. Unlike blackbox VQA models that directly generate answers, DCLUB first produces a set of visual clues - natural language statements of visually salient evidence from the image. It then uses a natural language inference model to determine the final answer based solely on the generated visual clues.
The key aspects of DCLUB are:
Interpretability: DCLUB is designed to be inherently interpretable, with the visual clues serving as human-legible explanations of the model's reasoning process. This addresses the lack of transparency in blackbox VQA models.
Faithfulness: DCLUB's predictions are entirely based on the generated visual clues, ensuring faithfulness between the explanations and the final output.
Performance: Evaluations show that DCLUB can achieve comparable performance to blackbox VQA models on benchmark datasets like VQA-v2 and GQA, while improving by 4.64% on a reasoning-focused test set.
The authors also collected a dataset of 1.7k VQA instances annotated with visual clues to train and evaluate DCLUB. Qualitative analysis reveals that DCLUB succeeds when it generates correct visual clues, but can fail due to issues like missing fine-grained object attributes, incorrect object status recognition, or neglecting small but important image regions.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Xingyu Fu,Be... alle arxiv.org 04-16-2024
https://arxiv.org/pdf/2305.14882.pdfDomande più approfondite