Główne pojęcia
The author explores a method to enhance group robustness in pre-trained vision-language models without using group annotations by calibrating representations through contrastive learning. The main thesis is to mitigate spurious correlations and improve model generalization without the need for group labels.
Streszczenie
The content discusses the challenges of fine-tuning pre-trained vision-language models like CLIP, focusing on mitigating reliance on spurious features without annotations. It introduces a novel representation calibration method, Contrastive Feature Recalibration (CFR), to enhance group robustness and reduce reliance on spurious correlations. Extensive experiments validate the effectiveness of the proposed approach in improving model generalization and reducing bias.
The study reveals issues with spurious correlations in pre-trained CLIP models and proposes a two-step approach involving creating a calibration set and recalibrating representations through contrastive learning. The method aims to improve feature quality, enhance robustness, and optimize group accuracy without relying on group-specific information.
Key points include identifying spurious correlations in CLIP, verifying the efficacy of feature extractors, introducing CFR for representation calibration, conducting experiments across benchmarks, comparing different sample selection strategies, and analyzing the impact of holistic data integration on performance.
Statystyki
Empirical Risk Minimization (ERM) raises risk of amplifying spurious correlations.
Last-layer retraining can greatly improve group robustness on pretrained CLIP.
Contrastive Feature Recalibration (CFR) significantly improves group robustness.
DPS+RNS outperforms other semi-supervised baselines across all benchmarks.
Cytaty
"Retraining the last layer can considerably improve the group robustness of a pre-trained CLIP."
"Our method demonstrates its capability to achieve state-of-the-art results in robust classification."