The paper addresses the misalignment between CLIP's image feature extraction method and its pre-training paradigm. It presents a new perspective based on nearest neighbors to understand CLIP's strong zero-shot image classification capability. The key insight is that CLIP's effective text-image matching capability embeds image information in image-text distances. This leads to the proposal of the Cross-Modal Neighbor Representation (CODER), which utilizes these image-text distances for improved image representation.
To construct a high-quality CODER, the authors introduce the Auto Text Generator (ATG) to automatically generate diverse and high-quality texts, ensuring dense sampling of neighbor texts. The experiments show that CODER consistently boosts CLIP's zero-shot and few-shot image classification performance across various datasets and model architectures. The rerank stage based on one-to-one specific CODER further enhances CLIP's performance, demonstrating the effectiveness of the proposed two-stage zero-shot classification method.
The paper also highlights the importance of dense neighbor text sampling in improving CODER quality. Experiments validate that as the diversity and quantity of high-quality cross-modal neighbor texts increase, the constructed CODER correspondingly improves.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문