MOFI: Learning Image Representations from Noisy Entity Annotated Images at ICLR 2024
Core Concepts
MOFI is a vision foundation model designed to learn image representations from noisy entity annotated images, achieving state-of-the-art performance.
Abstract
Abstract:
Introduces MOFI, a new vision foundation model.
Utilizes noisy entity annotated images for learning image representations.
Achieves 86.66% mAP on the GPR1200 dataset, surpassing CLIP's performance.
Introduction:
Focuses on acquiring high-quality image representations.
Discusses challenges in scaling datasets for supervised image classification.
Data Extraction Method:
Introduces the I2E dataset with 1 billion images and 2 million entities.
Studies different training recipes like supervised pre-training and contrastive pre-training.
Results and Experiments:
MOFI outperforms CLIP on various tasks like image retrieval and zero-shot classification.
Demonstrates strong performance on benchmarks like ImageNet and VTAB.
Conclusion:
Highlights the effectiveness of MOFI in learning robust image representations from noisy data.
Emphasizes the importance of combining supervised and contrastive pre-training approaches.
MOFI
Stats
MOFI achieves 86.66% mAP on the GPR1200 dataset.
The I2E dataset consists of 1 billion images and 2 million distinct entities.
Quotes
"Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities."
"The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset."