Decoupling logit outputs at different scales enhances knowledge transfer, improving student performance.
The authors propose Contrastive Abductive Knowledge Extraction (CAKE) as a model-agnostic method to mimic deep classifiers without access to original data, paving the way for broad application.
The author explores a variant of Knowledge Distillation without temperature scaling on the student side, known as Transformed Teacher Matching (TTM), to improve model generalization. Additionally, Weighted TTM (WTTM) is introduced as an effective distillation approach.