Core Concepts
Anchor-based Robust Finetuning (ARF) regularizes the finetuning process of vision-language models like CLIP to preserve their out-of-distribution generalization capabilities in both domain shift and zero-shot learning scenarios.
Abstract
The paper proposes an Anchor-based Robust Finetuning (ARF) approach to finetune vision-language models like CLIP while preserving their out-of-distribution (OOD) generalization capabilities in both domain shift and zero-shot learning scenarios.
The key insights are:
Conventional finetuning methods that only use class labels as supervision can lead to a significant degradation of the model's OOD generalization.
ARF incorporates two types of anchors to regularize the finetuning process:
Text-Compensated Anchor Generation (TCAG) module uses a pretrained captioner to generate rich semantic text descriptions as anchors.
Image-Text Anchor Retrieval (ITAR) module retrieves relevant image-text pairs from a dataset similar to the pretraining data of CLIP as additional anchors.
These two types of anchors with abundant semantic information help preserve the original feature space of CLIP, thereby maintaining its OOD generalization.
Extensive experiments demonstrate that ARF achieves state-of-the-art performance on domain shift and zero-shot learning benchmarks while matching the in-distribution accuracy of conventional finetuning methods.
Stats
The in-distribution datasets used for finetuning are ImageNet and DomainNet-Real.
The domain shift evaluation datasets include ImageNet-V2, ImageNet-Sketch, ImageNet-A, ImageNet-R, ObjectNet, and DomainNet (Clipart, Infograph, Painting, Sketch).
The zero-shot learning evaluation datasets include Caltech101, Flowers102, Food101, SUN397, DTD, FGVCAircraft, StanfordCars, OxfordPets, EuroSAT, and UCF101.
Quotes
"Anchor-based Robust Finetuning (ARF) regularizes the finetuning process of vision-language models like CLIP to preserve their out-of-distribution generalization capabilities in both domain shift and zero-shot learning scenarios."
"The decline in OOD generalization stems from the semantic-scarce supervision containing only class labels during the finetuning process."
"These two types of anchors with abundant semantic information help preserve the original feature space of CLIP, thereby maintaining its OOD generalization."