Core Concepts
Existing Vision Language Models (VLMs) often struggle to align image and text effectively because they treat all text tokens equally, regardless of their visual relevance. This paper introduces Contrastive Alignment (CAL), a method that prioritizes visually correlated tokens during training, leading to significant improvements in VLM performance on various tasks.
Xiao, X., Wu, B., Wang, J., Li, C., Zhou, X., & Guo, H. (2024). Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment. Advances in Neural Information Processing Systems, 38.
This paper addresses the limitations of existing image-text alignment strategies in Vision Language Models (VLMs) that treat all text tokens equally, regardless of their visual relevance. The authors propose a novel method called Contrastive Alignment (CAL) to prioritize visually correlated tokens during training, aiming to enhance the alignment between visual and textual modalities.