The authors investigate the conceptual understanding capabilities of large visual-language (V+L) models by developing three novel benchmarking datasets: Probe-R, Probe-A, and Probe-B.
Probe-R evaluates the models' understanding of object relations by comparing an image to correct and incorrect prompts where the predicate is swapped. Probe-A examines the models' grasp of attribute-object relationships by comparing two images and two prompts, swapping either the attribute or the object. Probe-B probes the models' reliance on background context by removing the background and observing the change in performance.
The authors experiment with five state-of-the-art V+L models and make several key observations:
For compositional understanding, they find that models struggle with compositionality, and that CNN-based backbones may be better at recognizing texture and patterns, while ViT backbones are better with color and shape.
For relational understanding, they observe that both modality-specific attention and co-attention in parallel improve relational understanding, and that predicate swapping that violates expectations surfaces the lack of an underlying conceptual model.
For contextual understanding, they find that models tend to not use context in order to recognize most objects, again indicating a lack of an underlying conceptual model.
The authors further utilize these insights and propose a simple finetuning approach based on selective negatives, which yields improved performance on their understanding-related probes at the expense of a slight loss in general performance.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문