The author explores the synergy between sketches and text for fine-grained image retrieval, introducing a novel compositionality framework driven by pre-trained CLIP models.
Combining sketches and text for precise image retrieval.
The paper presents a set of practical guidelines for designing high-performance fine-grained image retrieval models, and proposes a novel Dual Visual Filtering mechanism and a Discriminative Model Training strategy to effectively capture subcategory-specific discrepancies and enhance the model's discriminability and generalization ability.