The paper examines the compositionality of large generative vision-language models (GVLMs) and identifies the syntactical bias in current benchmarks for evaluating multimodal compositionality. The authors make the following key observations:
The VisualGPTScore, a generative evaluation metric used to assess GVLMs, is more sensitive to the syntax and order of reference sentences compared to contrastive metrics like BERTScore. This indicates that GVLMs tend to prioritize syntactical correctness over content relevance when differentiating positive and negative samples.
The authors find that current benchmarks, such as Winoground, VL-CheckList, ARO, and CREPE, exhibit a prevalent syntactical bias, which can be exploited by the linguistic capability of GVLMs. This bias renders the VisualGPTScore an insufficient metric for assessing the true multimodal compositionality of GVLMs.
To address this issue, the authors propose the following:
They introduce a SyntaxBias Score to quantify the syntactical discrepancy between positive and negative reference sentences in the existing benchmarks.
They create a challenging new task to evaluate the robustness of GVLMs against their inherent inclination toward syntactical correctness.
They leverage the SyntaxBias Score to filter and modify the existing benchmarks, resulting in a novel benchmark called SyntActically DE-biased (SADE) benchmark.
The authors evaluate several state-of-the-art GVLMs on the SADE benchmark and provide insights into their compositional reasoning capabilities. The SADE benchmark aims to facilitate future research in this direction by providing a more robust and unbiased evaluation of the visio-linguistic compositionality of GVLMs.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询