The authors propose a retrieval-based pipeline for automatic dataset construction and introduce TransAgg, a transformer-based adaptive aggregation model, showing competitive performance in zero-shot composed image retrieval.
Initiating a scalable pipeline for automatic dataset construction and proposing TransAgg model for zero-shot composed image retrieval.
Scaling the number of positive and negative examples in contrastive learning can effectively improve the performance of composed image retrieval models.
A simple yet effective framework, DQU-CIR, that performs raw-data level multimodal fusion to fully leverage the multimodal encoding and cross-modal alignment capabilities of vision-language pre-trained models for composed image retrieval.