Hümmer, C., Schwonberg, M., Zhou, L., Cao, H., Knoll, A., & Gottschalk, H. (2024). Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning. arXiv preprint arXiv:2312.02021v3.
This paper investigates the effectiveness of fine-tuning vision-language pre-trained models, specifically CLIP and EVA-CLIP, as a simple baseline for domain generalization in dense perception tasks, namely semantic segmentation and object detection. The authors aim to challenge the prevailing practice of relying on ImageNet pre-training or complex domain generalization techniques.
The authors fine-tune CLIP and EVA-CLIP pre-trained ViT encoders with task-specific decoders (Mask2Former for segmentation and ViTDet for object detection) on synthetic datasets (GTA5 for segmentation and UrbanSyn for object detection). They evaluate the performance on various real-world datasets (Cityscapes, BDD100k, Mapillary, ACDC) for both synthetic-to-real and real-to-real domain generalization settings. The performance is compared against state-of-the-art domain generalization methods, including those utilizing ImageNet pre-training, self-supervised learning, and other vision-language models.
The study demonstrates that the rich and diverse knowledge encoded in vision-language pre-trained models like CLIP can be effectively transferred to downstream dense perception tasks for achieving strong domain generalization. The authors advocate for adopting vision-language pre-training as a new standard for domain generalization, moving beyond the limitations of ImageNet-based approaches.
This research significantly contributes to the field of domain generalization by presenting a simple yet highly effective baseline that leverages the power of vision-language pre-trained models. The findings have practical implications for deploying robust and generalizable computer vision systems in real-world scenarios.
The study primarily focuses on transformer-based architectures and a limited set of datasets. Future research could explore the effectiveness of this approach with other architectures and on a wider range of datasets and tasks. Additionally, investigating the impact of different fine-tuning strategies and the integration of complementary domain generalization techniques could further enhance performance.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Chri... om arxiv.org 11-01-2024
https://arxiv.org/pdf/2312.02021.pdfDiepere vragen