MOFI presents a novel approach to learning image representations from noisy entity annotated images. The model focuses on pre-training data and training recipes, introducing the Image-to-Entities (I2E) dataset with 1 billion images and 2 million distinct entities. By combining supervised pre-training, contrastive pre-training, and multi-task learning, MOFI achieves significant performance improvements in image retrieval tasks like GPR1200. The model surpasses the state-of-the-art CLIP model's performance on various benchmarks, demonstrating the effectiveness of the I2E dataset in learning strong image representations.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Wentao Wu,Al... ב- arxiv.org 03-19-2024
https://arxiv.org/pdf/2306.07952.pdfשאלות מעמיקות