Core Concepts
ScaleKD is a novel knowledge distillation method that effectively transfers the knowledge from large, pre-trained vision transformers (ViTs) to diverse student models, including CNNs, MLPs, and smaller ViTs, achieving state-of-the-art results and potentially replacing the need for time-intensive pre-training of student models.
Stats
ScaleKD achieves 75.15% top-1 accuracy for MobileNet-V1, 82.03% for ResNet-50, 84.16% for ConvNeXt-T, 78.63% for Mixer-S/16, 81.96% for Mixer-B/16, 83.93% for ViT-S/16, 83.80% for Swin-T, and 85.53% for ViT-B/16 models trained on ImageNet-1K from scratch, showing 3.05%, 3.39%, 2.02%, 4.61%, 5.52%, 4.03%, 2.62%, and 3.73% absolute gains to the individually trained counterparts, respectively.
ScaleKD with Swin-L as the teacher outperforms individually trained ResNet-152, Mixer-B/16, and ViT-B/16 by a margin of 0.28%, 2.19%, and 2.13% respectively, while achieving over 2.35x, 3.23x, and 3.83x compression in terms of model size.
ScaleKD achieves a mean top-1 accuracy improvement of 3.94% over 11 teacher-student pairs, with a maximum of 6.27%.
ScaleKD views 5.58x, 11.75x, 195.39x, and 8.73x fewer samples than counterpart methods based on supervised pre-training, self-supervised pre-training, cross-modal pre-training, and hybrid pre-training, respectively.
ResNet-50 and Swin-T pre-trained by ScaleKD outperform their baselines by an average precision (AP) margin of 2.1% and 1.7% for object detection and 2.0% and 1.5% for instance segmentation on MS-COCO, respectively.
ViT-B/16 pre-trained by ScaleKD achieves a 4.09% absolute mean intersection over union (mIOU) gain for semantic segmentation on ADE20K, surpassing its gain on ImageNet-1K.
ScaleKD outperforms recent top-performing KD methods like DIST, DiffKD, and OFA by clear margins (0.70% and 1.30% on ResNet-50 and Swin-T, respectively) despite using a less performant teacher and fewer training epochs.
ScaleKD even surpasses FunMatch by 0.24% in top-1 accuracy while utilizing less than 10% of the training epochs.