Evaluating the Compositional Reasoning Capabilities of Large Generative Vision-Language Models
The core message of this paper is to examine the compositionality of large generative vision-language models (GVLMs) and identify the syntactical bias in current benchmarks, which can be exploited by the linguistic capability of GVLMs. The authors propose a novel benchmark, SADE, to provide a more robust and unbiased evaluation of the visio-linguistic compositionality of GVLMs.