The paper addresses the misalignment between CLIP's image feature extraction method and its pre-training paradigm. It presents a new perspective based on nearest neighbors to understand CLIP's strong zero-shot image classification capability. The key insight is that CLIP's effective text-image matching capability embeds image information in image-text distances. This leads to the proposal of the Cross-Modal Neighbor Representation (CODER), which utilizes these image-text distances for improved image representation.
To construct a high-quality CODER, the authors introduce the Auto Text Generator (ATG) to automatically generate diverse and high-quality texts, ensuring dense sampling of neighbor texts. The experiments show that CODER consistently boosts CLIP's zero-shot and few-shot image classification performance across various datasets and model architectures. The rerank stage based on one-to-one specific CODER further enhances CLIP's performance, demonstrating the effectiveness of the proposed two-stage zero-shot classification method.
The paper also highlights the importance of dense neighbor text sampling in improving CODER quality. Experiments validate that as the diversity and quantity of high-quality cross-modal neighbor texts increase, the constructed CODER correspondingly improves.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Chao Yi,Lu R... lúc arxiv.org 04-30-2024
https://arxiv.org/pdf/2404.17753.pdfYêu cầu sâu hơn