toplogo
Accedi

DeiT-LT: Efficient Training of Vision Transformers on Long-Tailed Datasets


Concetti Chiave
DeiT-LT introduces an efficient distillation scheme to train Vision Transformers from scratch on long-tailed datasets. It leverages out-of-distribution distillation and low-rank feature learning to create specialized experts for majority and minority classes within a single ViT architecture.
Sintesi

The paper introduces DeiT-LT, a training scheme for efficiently training Vision Transformers (ViTs) from scratch on long-tailed datasets. The key components of DeiT-LT are:

  1. Out-of-Distribution (OOD) Distillation: DeiT-LT distills knowledge from a CNN teacher using strongly augmented OOD images. This induces local, generalizable features in the early ViT blocks, improving performance on minority (tail) classes.

  2. Tail Expert with DRW Loss: The distillation token (DIST) in DeiT-LT is trained using Deferred Re-Weighting (DRW) loss, which enhances the focus on learning from tail classes. This leads to the DIST token becoming an expert on tail classes, while the classification token (CLS) becomes an expert on head classes.

  3. Low-Rank Feature Learning: DeiT-LT distills knowledge from CNN teachers trained using Sharpness Aware Minimization (SAM), which induces low-rank generalizable features across the ViT blocks. This further improves the model's performance on minority classes.

The authors demonstrate the effectiveness of DeiT-LT across diverse small-scale (CIFAR-10 LT, CIFAR-100 LT) and large-scale (ImageNet-LT, iNaturalist-2018) long-tailed datasets. DeiT-LT outperforms state-of-the-art CNN-based methods and other ViT baselines, achieving superior performance on both majority and minority classes without requiring any pre-training.

edit_icon

Personalizza riepilogo

edit_icon

Riscrivi con l'IA

edit_icon

Genera citazioni

translate_icon

Traduci origine

visual_icon

Genera mappa mentale

visit_icon

Visita l'originale

Statistiche
The number of samples in the most frequent class is up to 100 times more than the least frequent class in the long-tailed datasets used. DeiT-LT (with PaCo+SAM teacher) achieves 87.5% overall accuracy on CIFAR-10 LT (ρ=100) and 55.6% on CIFAR-100 LT (ρ=100), outperforming the teacher by 1.9% and 4.5% respectively. On ImageNet-LT, DeiT-LT (with PaCo+SAM teacher) achieves 59.1% overall accuracy, a 1.6% improvement over the teacher. On iNaturalist-2018, DeiT-LT (with PaCo+SAM teacher) achieves 75.1% overall accuracy, a 1.7% improvement over the teacher.
Citazioni
"DeiT-LT involves distilling knowledge from low-resolution teacher networks using out-of-distribution (OOD) images generated through strong augmentations." "To improve the generality of features, we propose to distill knowledge via flat CNN teachers trained through Sharpness Aware Minimization (SAM)." "In DeiT-LT, we ensure the divergence of the classification and distillation tokens such that the classification token becomes an expert on the majority classes, whereas the distillation token learns local low-rank features, becoming an expert on the minority."

Domande più approfondite

How can the DeiT-LT framework be extended to other transformer-based architectures beyond ViT to improve their performance on long-tailed datasets

The DeiT-LT framework can be extended to other transformer-based architectures beyond ViT by adapting the distillation and training strategies to suit the specific architecture. For instance, for architectures like BERT or GPT, the distillation process can involve transferring knowledge from pre-trained language models to improve their performance on long-tailed datasets. The key lies in identifying the appropriate teacher models and designing the distillation process to effectively transfer knowledge while addressing the challenges posed by long-tailed data distributions. Additionally, incorporating techniques like deferred re-weighting and inducing low-rank features from flat teachers can be beneficial for enhancing the generalizability of features in other transformer architectures.

What are the potential limitations of the DeiT-LT approach, and how can they be addressed in future work

One potential limitation of the DeiT-LT approach is the heavy reliance on distillation for learning from tail classes, which may lead to saturation in learning from the CNN teacher and hinder further improvements in performance. To address this limitation, future work could explore adaptive methods that dynamically shift the focus from distillation to ground truth labels as the training progresses. This adaptive approach could help prevent saturation and allow the model to continuously learn from both the teacher and ground truth labels, leading to better performance on long-tailed datasets. Additionally, exploring novel techniques to balance the learning from both majority and minority classes within the transformer backbone could further enhance the effectiveness of the DeiT-LT approach.

Can the insights from DeiT-LT, such as the creation of specialized experts within a single transformer backbone, be applied to other domains beyond computer vision to tackle long-tailed data distributions

The insights from DeiT-LT, such as the creation of specialized experts within a single transformer backbone, can be applied to other domains beyond computer vision to tackle long-tailed data distributions. For example, in natural language processing tasks, this approach could be utilized to train transformer models on imbalanced text datasets. By creating specialized experts within the transformer architecture that focus on different classes or categories of text data, the model can effectively learn from both majority and minority classes, improving performance on long-tailed text datasets. This approach can be particularly useful in sentiment analysis, document classification, and other text-based tasks where imbalanced data distributions are common.
0
star