The paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to assess both vision-language models (VLMs) and unimodal language models (ULMs).
The evaluation on 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders.
The paper's contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Sri Harsha D... at arxiv.org 04-26-2024
https://arxiv.org/pdf/2404.16365.pdfDeeper Inquiries