Conceptos Básicos
Retrieval-augmented adaptation can significantly improve the performance of pre-trained contrastive vision-language models on downstream tasks, especially in low-data regimes. The key insights are: (1) image-to-image (I2I) retrieval consistently outperforms text-to-image (T2I) retrieval, and (2) ensembling the zero-shot prediction with retrieved samples is critical for effective adaptation.
Resumen
This work presents a systematic study to understand the impact of retrieval-augmented adaptation on the performance of pre-trained contrastive vision-language models, such as CLIP, in low-data scenarios.
The key findings are:
Retrieval method:
I2I retrieval, which uses a few seed images from the target dataset, consistently outperforms T2I retrieval, which uses textual class descriptions, across a wide range of downstream tasks and shot sizes.
The superior performance of I2I retrieval is attributed to the fact that it can retrieve samples that better match the target data distribution, whereas T2I retrieval can suffer from semantic ambiguity.
Logit ensemble:
Ensembling the zero-shot prediction from the pre-trained CLIP model with the logit from the retrieved samples is the key to improved adaptation performance.
Without ensembling, the performance of retrieval-augmented adaptation significantly degrades.
The authors also provide theoretical analysis to support these empirical observations. They characterize the modality gap and distribution shift induced by different retrieval methods, and prove the importance of logit ensemble for effective CLIP-based adaptation.
The work also explores alternative design choices, such as the impact of model architecture, the number of seed images, and adaptation with a mixture of retrieved and few-shot samples. The results demonstrate the consistent benefits of retrieval-augmented adaptation across different settings.
Estadísticas
"The zero-shot CLIP model achieves an average accuracy of 66.8% across the test datasets."
"I2I retrieval with 16 samples per class can improve the average accuracy to 73.9%, which is close to the oracle performance of 77.9% when directly retrieving from the target distribution."
Citas
"I2I retrieval consistently outperforms T2I retrieval across all shots and datasets."
"Ensembling the zero-shot prediction together with I2I-retrieved samples is the key to improved adaptation performance."