The modality gap, a separation of image and text embeddings in the shared representation space, and the bias towards objects over other factors, such as attributes, are two key challenges in contrastive vision-language representation learning. The driving factor behind both phenomena is the information imbalance between images and their captions.
This paper proposes an improved probabilistic cross-modal embedding model, PCME++, to effectively capture the inherent ambiguity in image-text matching datasets caused by multiplicity and abundant false negatives. PCME++ introduces a new closed-form probabilistic distance, optimization techniques using pseudo-positives and mixed sample data augmentation, and demonstrates superior performance on various benchmarks.