Hümmer, C., Schwonberg, M., Zhou, L., Cao, H., Knoll, A., & Gottschalk, H. (2024). Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning. arXiv preprint arXiv:2312.02021v3.
This paper investigates the effectiveness of fine-tuning vision-language pre-trained models, specifically CLIP and EVA-CLIP, as a simple baseline for domain generalization in dense perception tasks, namely semantic segmentation and object detection. The authors aim to challenge the prevailing practice of relying on ImageNet pre-training or complex domain generalization techniques.
The authors fine-tune CLIP and EVA-CLIP pre-trained ViT encoders with task-specific decoders (Mask2Former for segmentation and ViTDet for object detection) on synthetic datasets (GTA5 for segmentation and UrbanSyn for object detection). They evaluate the performance on various real-world datasets (Cityscapes, BDD100k, Mapillary, ACDC) for both synthetic-to-real and real-to-real domain generalization settings. The performance is compared against state-of-the-art domain generalization methods, including those utilizing ImageNet pre-training, self-supervised learning, and other vision-language models.
The study demonstrates that the rich and diverse knowledge encoded in vision-language pre-trained models like CLIP can be effectively transferred to downstream dense perception tasks for achieving strong domain generalization. The authors advocate for adopting vision-language pre-training as a new standard for domain generalization, moving beyond the limitations of ImageNet-based approaches.
This research significantly contributes to the field of domain generalization by presenting a simple yet highly effective baseline that leverages the power of vision-language pre-trained models. The findings have practical implications for deploying robust and generalizable computer vision systems in real-world scenarios.
The study primarily focuses on transformer-based architectures and a limited set of datasets. Future research could explore the effectiveness of this approach with other architectures and on a wider range of datasets and tasks. Additionally, investigating the impact of different fine-tuning strategies and the integration of complementary domain generalization techniques could further enhance performance.
לשפה אחרת
מתוכן המקור
arxiv.org
שאלות מעמיקות