Kernkonzepte
Learning a system of CLIP data experts via clustering to mitigate noise from false negative samples in web-crawled image-caption pairs and enhance the contrastive learning.
Zusammenfassung
The content discusses the Mixture of Data Experts (MoDE) framework for contrastive language-image pretraining (CLIP). The key insights are:
The success of CLIP relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. The captions may describe limited visual content or even be unrelated to the images, leading to false negative samples that hurt the quality of the contrastive learning.
MoDE addresses this issue by learning a system of CLIP data experts via clustering. The training data (image-caption pairs) is first clustered into several disjoint subsets based on the captions. Each cluster is then used to train a specialized data expert model, which is less sensitive to the noise from other clusters and can learn more effectively among the semantically similar data.
At inference time, the task metadata (e.g., class names) is compared to the centroid of each data cluster to determine which data experts need to be activated. The outputs of the selected data experts are then ensembled to make the final prediction.
Experiments show that MoDE outperforms several state-of-the-art vision-language models on multiple standard benchmarks, including zero-shot image classification, image-to-text retrieval, and text-to-image retrieval. The superiority of MoDE can be attributed to the better trained individual data expert models, which benefit from the reduced false negative samples and increased hard negatives within each cluster.
MoDE is also uniquely positioned for large-scale training, as each data expert uses only a fraction of the whole dataset and can be more easily trained with fewer compute resources asynchronously.
Statistiken
Four ViT-B/16 data experts in MoDE can outperform the single ViT-L/14 model by OpenAI CLIP and OpenCLIP on zero-shot image classification, but with less than 35% of the training cost.
Zitate
"The key to the success of contrastive vision-language representation learning lies in the creation of quality negative examples for training."
"MoDE separates false negative samples into different clusters and groups the pairs with similar semantics, which mitigates noise from false-negative captions while incorporating a more challenging set of hard-negative examples, thereby enhancing vision-language pre-training."