This paper proposes a novel knowledge distillation framework called Block-wise Logit Distillation (Block-KD) that bridges the gap between logit-based and feature-based distillation methods, achieving superior performance by implicitly aligning features through a series of intermediate "stepping-stone" models.
서로 다른 데이터셋으로 학습된 여러 교사 모델의 지식을 결합하여 단일 학생 모델로 전이하는 다단계 특징 증류(MLFD) 기법을 제시하며, 이를 통해 단일 데이터셋 학습 모델 대비 성능 향상을 달성할 수 있다.
In high-dimensional linear regression, strategically crafting a "weak teacher" model for knowledge distillation can outperform training with true labels, but it cannot fundamentally change the data scaling law.
SIKeD, a novel iterative knowledge distillation technique, enhances the mathematical reasoning abilities of smaller language models by addressing the limitations of traditional distillation methods and enabling the models to effectively learn and select from multiple reasoning strategies.
The CAKD framework enhances knowledge distillation in neural networks by decoupling the Kullback-Leibler (KL) divergence loss function, allowing for targeted emphasis on critical elements and improving knowledge transfer efficiency from teacher to student models.
This paper introduces PCKD, a novel knowledge distillation method for convolutional neural networks that improves student network performance by transferring knowledge from teacher networks using a category contrastive learning approach and a preview-based learning strategy to handle samples of varying difficulty.
This paper introduces TAS, a novel knowledge distillation method that uses a hybrid assistant model to bridge the gap between teacher and student networks with different architectures, enabling efficient knowledge transfer in cross-architecture knowledge distillation (CAKD).
Dual augmentation in knowledge distillation, where different augmentations are applied to teacher and student models, improves the transfer of invariant representations, leading to more robust and generalizable student models, especially in same-architecture settings.
Knowledge distillation (KD) methods based on Kullback-Leibler (KL) divergence often struggle to effectively transfer knowledge from larger, more accurate teacher models to smaller student models due to capacity mismatch and the implicit alteration of inter-class relationships. This paper introduces Correlation Matching Knowledge Distillation (CMKD), a novel approach that leverages both Pearson and Spearman correlation coefficients to address these limitations and achieve more efficient and robust distillation from stronger teacher models.
Progressive distillation, a technique where a student model learns from intermediate checkpoints of a teacher model, accelerates training by implicitly providing a curriculum of easier-to-learn subtasks, as demonstrated through theoretical analysis and empirical results on sparse parity, probabilistic context-free grammars (PCFGs), and real-world language modeling tasks.