This comprehensive survey paper delves into the key advancements within the realm of Vision-Language Models (VLMs), categorizing them into three distinct groups based on their input processing and output generation capabilities: Vision-Language Understanding Models, Multimodal Input Text Generation Models, and Multimodal Input-Multimodal Output Models. The paper provides an extensive analysis of the foundational architectures, training data sources, strengths, and limitations of various VLMs, offering readers a nuanced understanding of this dynamic domain. It also highlights potential avenues for future research in this rapidly evolving field.
Leveraging text-to-image diffusion models, we generate a large-scale dataset of synthetic counterfactual image-text pairs to probe and mitigate intersectional social biases in state-of-the-art vision-language models.
VLMs can improve their semantic grounding performance by receiving and generating feedback, without requiring in-domain data, fine-tuning, or modifications to the network architectures.
Leveraging complementary sources of information - descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets - to improve the zero-shot classification performance of vision-language models (VLMs) across fine-grained domains.
LVLM-Intrepret is a novel interactive application designed to enhance the interpretability of large vision-language models by providing insights into their internal mechanisms, including image patch importance, attention patterns, and causal relationships.
Addressing underspecification in visual question inputs can improve zero-shot performance of large vision-language models by incorporating relevant visual details and commonsense reasoning.
The core message of this paper is to examine the compositionality of large generative vision-language models (GVLMs) and identify the syntactical bias in current benchmarks, which can be exploited by the linguistic capability of GVLMs. The authors propose a novel benchmark, SADE, to provide a more robust and unbiased evaluation of the visio-linguistic compositionality of GVLMs.
Open-vocabulary vision-language models like CLIP struggle to interpret compound nouns as effectively as they understand individual nouns, particularly when one noun acts as an attribute to the other.
We decompose CLIP's image representation into text-interpretable components attributed to individual attention heads and image locations, revealing specialized roles for many heads and emergent spatial localization.
Large vision-language models (LVLMs) have recently achieved rapid progress, but current evaluation methods have two primary issues: 1) Many evaluation samples do not require visual understanding, as the answers can be directly inferred from the questions and options or the world knowledge embedded in language models. 2) Unintentional data leakage exists in the training of LLMs and LVLMs, allowing them to answer some visual-necessary questions without accessing the images.