Kernekoncepter
The core message of this paper is to examine the compositionality of large generative vision-language models (GVLMs) and identify the syntactical bias in current benchmarks, which can be exploited by the linguistic capability of GVLMs. The authors propose a novel benchmark, SADE, to provide a more robust and unbiased evaluation of the visio-linguistic compositionality of GVLMs.
Resumé
The paper examines the compositionality of large generative vision-language models (GVLMs) and identifies the syntactical bias in current benchmarks for evaluating multimodal compositionality. The authors make the following key observations:
The VisualGPTScore, a generative evaluation metric used to assess GVLMs, is more sensitive to the syntax and order of reference sentences compared to contrastive metrics like BERTScore. This indicates that GVLMs tend to prioritize syntactical correctness over content relevance when differentiating positive and negative samples.
The authors find that current benchmarks, such as Winoground, VL-CheckList, ARO, and CREPE, exhibit a prevalent syntactical bias, which can be exploited by the linguistic capability of GVLMs. This bias renders the VisualGPTScore an insufficient metric for assessing the true multimodal compositionality of GVLMs.
To address this issue, the authors propose the following:
They introduce a SyntaxBias Score to quantify the syntactical discrepancy between positive and negative reference sentences in the existing benchmarks.
They create a challenging new task to evaluate the robustness of GVLMs against their inherent inclination toward syntactical correctness.
They leverage the SyntaxBias Score to filter and modify the existing benchmarks, resulting in a novel benchmark called SyntActically DE-biased (SADE) benchmark.
The authors evaluate several state-of-the-art GVLMs on the SADE benchmark and provide insights into their compositional reasoning capabilities. The SADE benchmark aims to facilitate future research in this direction by providing a more robust and unbiased evaluation of the visio-linguistic compositionality of GVLMs.
Statistik
The paper does not contain any specific metrics or figures to support the key logics. The analysis is based on qualitative observations and the introduction of new evaluation metrics and benchmarks.
Citater
The paper does not contain any striking quotes that support the key logics.