核心概念
This paper proposes a novel knowledge distillation framework called Block-wise Logit Distillation (Block-KD) that bridges the gap between logit-based and feature-based distillation methods, achieving superior performance by implicitly aligning features through a series of intermediate "stepping-stone" models.
统计
The student's shallow blocks are gradually substituted with the teacher's to establish stepping-stones, which are eventually abandoned during inference.
The objectives of the stepping stone, {LNi}i≤n, are divided by the coefficient 2n−i, where n is the total number of blocks.
On CIFAR-100, models are trained using SGD for 240 epochs with a batch size of 64.
On ImageNet, models are trained for 100 epochs with a batch size of 512.
For NLP tasks, training involves 10 epochs of fine-tuning with a batch size of 32.
引用
"This paper provides a unified perspective of feature alignment in order to obtain a better comprehension of their fundamental distinction."
"Inheriting the design philosophy and insights of feature-based and logit-based methods, we introduce a block-wise logit distillation framework to apply implicit logit-based feature alignment by gradually replacing teacher’s blocks as intermediate stepping-stone models to bridge the gap between the student and the teacher."
"Our method obtains comparable or superior results to state-of-the-art distillation methods."