核心概念
Vision Transformers demonstrate significant potential in addressing distribution shifts through diverse domain adaptation and domain generalization strategies, including feature-level, instance-level, model-level, and hybrid approaches.
摘要
The paper provides a comprehensive review of the research on adapting Vision Transformers (ViTs) to handle distribution shifts in computer vision tasks. It covers the fundamentals and architecture of ViTs, and then delves into the various strategies employed for Domain Adaptation (DA) and Domain Generalization (DG).
For DA, the paper categorizes the research into feature-level adaptation, instance-level adaptation, model-level adaptation, and hybrid approaches. Feature-level adaptation focuses on aligning feature distributions between source and target domains. Instance-level adaptation prioritizes specific data points that better reflect the target domain characteristics. Model-level adaptation involves developing specialized ViT architectures or layers to enhance adaptability. Hybrid approaches combine multiple adaptation techniques.
The paper also discusses diverse strategies used to enhance DA, such as adversarial learning, cross-domain knowledge transfer, visual prompts, self-supervised learning, hybrid networks, knowledge distillation, source-free adaptation, test-time adaptation, and pseudo-label refinement.
For DG, the paper explores multi-domain learning, meta-learning, regularization techniques, and data augmentation strategies to enable ViTs to generalize well to unseen domains.
The comprehensive tables provided in the paper offer valuable insights into the various approaches researchers have taken to address distribution shifts by integrating ViTs. The findings highlight the versatility of ViTs in managing distribution shifts, which is crucial for real-world applications, especially in critical safety and decision-making scenarios.
統計資料
"Deep learning models are often evaluated in scenarios where the data distribution is different from those used in the training and validation phases."
"Extensive use has led to detailed empirical [9] and analytical evaluations of convolutional networks [10, 11]."
"Recent advancements, however, have shown the potential of transformers regarding to their self-attention mechanisms that find the global features of the data which provide a more holistic view of the data, reduce inductive bias, and exhibit a high degree of scalability and flexibility."
"Researchers discovered that existing CNN architectures exhibit limited generalization capabilities when confronted with distribution shift scenarios [27, 28]."
引述
"Vision Transformers (ViT) [16], stands out as a key development in this area, applying a self-attention-based mechanism to sequences of image patches. It achieves competitive performance on the challenging ImageNet classification task [26], compared to CNNs."
"ViTs employ multi-head self-attention to intricately parse and interpret contextual information within images, thereby excelling in scenarios involving occlusions, domain variations, and perturbations. They demonstrate remarkable robustness, effectively maintaining accuracy despite image modifications."
"ViTs' ability to merge various features for image classification enhances their performance across diverse datasets, proving advantageous in both conventional and few-shot learning settings where the model is trained with only a few examples [33, 47, 48]."