This study focuses on composed image retrieval (CIR) using text and images, proposing a scalable dataset construction method and an adaptive aggregation model. Extensive experiments demonstrate the effectiveness of the proposed approach in achieving state-of-the-art results in zero-shot scenarios.
In recent literature, vision-language models have shown progress in joint training of image and text representations. The study aims to improve image retrieval by leveraging reference images and relative captions. Existing approaches require manually constructed datasets for CIR training, which can be costly and time-consuming.
The authors introduce a scalable pipeline to automatically construct datasets for CIR training using large-scale image-caption data available online. They propose TransAgg, a transformer-based model that dynamically fuses information from different modalities. Results show that the proposed approach outperforms existing state-of-the-art models in zero-shot composed image retrieval benchmarks.
Key metrics or figures used to support the argument:
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yikun Liu,Ji... at arxiv.org 03-07-2024
https://arxiv.org/pdf/2306.07272.pdfDeeper Inquiries