The author proposes an Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA) framework to address the challenges of model training and deployment in composed image retrieval tasks. By leveraging a lightweight model for queries and a large vision-language model for galleries, ISA improves retrieval accuracy and efficiency.
Introducing a training-free method for zero-shot composed image retrieval with local concept reranking to enhance performance.
A novel language-only training framework, LinCIR, that efficiently learns a projection module to enable zero-shot composed image retrieval without relying on expensive image-text-image triplet datasets.
A novel zero-shot composed image retrieval method that uses spherical linear interpolation to directly merge image and text representations, combined with a text-anchored fine-tuning strategy to enhance the performance.
A novel zero-shot approach for Composed Image Retrieval (CIR) that maps reference images into pseudo-word tokens and combines them with relative captions to perform text-to-image retrieval.
A training-free approach for zero-shot composed image retrieval that leverages pretrained vision-language models and multimodal large language models to effectively fuse visual and textual information and incorporate textual descriptions of database images into the similarity computation.