Centrala begrepp
A novel few-shot feature distillation approach for vision transformers based on intermittent weight copying and enhanced low-rank adaptation, which enables efficient training of lightweight transformer models on limited data.
Sammanfattning
The paper proposes a novel few-shot feature distillation approach for vision transformers, called WeCoLoRA, which consists of two key steps:
-
Leveraging the consistent depth-wise structure of vision transformers, the authors first copy the weights from intermittent layers of existing pre-trained vision transformers (teachers) into shallower architectures (students), where the intermittence factor controls the complexity of the student transformer with respect to its teacher.
-
The authors then employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario, aiming to recover the information processing carried out by the skipped teacher layers.
The authors present comprehensive experiments with supervised and self-supervised transformers as teachers, on five data sets from various domains, including natural, medical and satellite images. The results confirm the superiority of the proposed WeCoLoRA approach over competitive baselines. The ablation study demonstrates the usefulness of each component of the proposed pipeline.
The authors also analyze the features learned by the distilled models, showing that WeCoLoRA generates more robust and discriminative features compared to the strongest competitor.
Statistik
"The accuracy tends to grow as the model gets larger [19,49], most of the attention has been dedicated to building larger and more powerful models."
"Large amounts of data are not always available in some domains, e.g. hyperspectral image segmentation [30]."
"Our approach is divided into two steps. For the first step, we leverage the fact that vision transformers have a consistent depth-wise structure, i.e. the input and output dimensions are compatible across transformer blocks."
"We perform the pre-training stage of efficient student transformers via few-shot knowledge distillation on various subsets of ImageNet-1K [15]."
Citat
"Few-shot knowledge distillation (FSKD) paradigm [24,39,51,53,55,58,59, 76, 79], which was explored in both language [53, 58, 79] and vision [24, 39,41, 42, 51, 59, 76] domains."
"We emphasize that FSKD has not been extensively explored in the vision domain [24, 39, 41, 42, 51, 55, 59, 76], with even less studies focused on the pre-training stage of vision transformers [42]."
"We present few-shot and linear probing experiments on five benchmark data sets comprising natural, medical and satellite images, demonstrating the utility of our training pipeline across different domains."