Miranda, I., Salaberria, A., Agirre, E., & Azkune, G. (2024). BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval. Advances in Neural Information Processing Systems, 37.
This paper introduces a new benchmark dataset, BIVLC, designed to evaluate the bidirectional vision-language compositionality (VLC) of multimodal models, addressing the limitations of existing datasets that primarily focus on image-to-text retrieval.
The authors extend the existing SUGARCREPE dataset by generating synthetic hard negative images for each hard negative text using a text-to-image generator (SDXL-DPO). Human annotators then filter the generated images, selecting the best match for each negative caption and removing ambiguous instances. The resulting BIVLC dataset contains a balanced set of image-to-text (I2T) and text-to-image (T2I) retrieval examples. The authors evaluate several state-of-the-art multimodal models on BIVLC, including contrastive models like CLIP and generative models like Open CapPa and VQAScore. They also propose two new training strategies for contrastive models: TROHN-TEXT, which utilizes hard negative texts generated by LLMs, and TROHN-IMG, which incorporates hard negative images generated using SDXL-DPO.
The introduction of BIVLC provides a more comprehensive and robust benchmark for evaluating the bidirectional VLC capabilities of multimodal models. The findings highlight the need for further research in developing models that can effectively handle both I2T and T2I retrieval tasks. The proposed training strategies, particularly the use of hard negative images, offer promising avenues for improving model performance on bidirectional VLC.
This research significantly contributes to the field of vision-language understanding by introducing a more challenging and comprehensive benchmark for evaluating bidirectional VLC. The findings and proposed training strategies have important implications for the development of more robust and versatile multimodal models.
The study is limited by the reliance on synthetically generated hard negative images, which may not fully capture the complexity of real-world images. Future research could explore methods for automatically generating and filtering hard negative images to improve the quality and scalability of training data. Additionally, investigating the reasons behind the disparity in model performance between I2T and T2I retrieval tasks could lead to the development of more balanced and effective multimodal models.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Imanol Miran... at arxiv.org 11-05-2024
https://arxiv.org/pdf/2406.09952.pdfDeeper Inquiries