The paper introduces MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. The key insight is that image pairs that naturally occur on the same web pages contain a wide range of implicit relations, which can be made explicit by synthesizing instructions via large multimodal models (LMMs) and large language models (LLMs).
The paper first describes the data construction pipeline to mine 36.7M (query image, instruction, target image) triplets with rich semantic relations from the web. It then introduces the MagicLens model, a lightweight dual-encoder architecture that jointly embeds a pair of image and instruction.
Across multiple benchmarks, MagicLens outperforms previous state-of-the-art methods but with a 50x smaller model size. A human evaluation on a 1.4M-scale retrieval pool further demonstrates that MagicLens can well capture and satisfy diverse search intents, especially complex and beyond visual ones.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Kai Zhang,Yi... at arxiv.org 03-29-2024
https://arxiv.org/pdf/2403.19651.pdfDeeper Inquiries