MOFI presents a novel approach to learning image representations from noisy entity annotated images. The model focuses on pre-training data and training recipes, introducing the Image-to-Entities (I2E) dataset with 1 billion images and 2 million distinct entities. By combining supervised pre-training, contrastive pre-training, and multi-task learning, MOFI achieves significant performance improvements in image retrieval tasks like GPR1200. The model surpasses the state-of-the-art CLIP model's performance on various benchmarks, demonstrating the effectiveness of the I2E dataset in learning strong image representations.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Wentao Wu,Al... klo arxiv.org 03-19-2024
https://arxiv.org/pdf/2306.07952.pdfSyvällisempiä Kysymyksiä