Enhancing CLIP's Image Classification Performance by Leveraging Cross-Modal Neighbor Representation
Leveraging the powerful cross-modal matching capabilities of CLIP, we introduce a novel Cross-Modal Neighbor Representation (CODER) to extract better image features, thereby improving CLIP's performance on downstream tasks like zero-shot and few-shot image classification.