Investigating the Modality Gap and Object Bias in Contrastive Vision-Language Representation Learning
The modality gap, a separation of image and text embeddings in the shared representation space, and the bias towards objects over other factors, such as attributes, are two key challenges in contrastive vision-language representation learning. The driving factor behind both phenomena is the information imbalance between images and their captions.