The paper investigates the influence of the difference in smoothness of logits between teacher and student networks on the knowledge distillation (KD) process. It is observed that a fixed temperature shared between the teacher and student models often fails to address this mismatch, hindering the effectiveness of KD.
To tackle this issue, the authors propose Dynamic Temperature Knowledge Distillation (DTKD). The key idea is to use the logsumexp function to quantify the "sharpness" of the logits distribution, which reflects the smoothness of the output. By minimizing the difference in sharpness between the teacher and student, DTKD can derive sample-specific temperatures for them respectively, establishing a moderate common ground of softness.
Extensive experiments on CIFAR-100 and ImageNet demonstrate that DTKD performs comparably or better than leading KD techniques, with added robustness in scenarios where only the target class or non-target class knowledge is distilled. DTKD also shows high training efficiency, requiring negligible additional overhead compared to vanilla KD.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Yukang Wei,Y... klo arxiv.org 04-22-2024
https://arxiv.org/pdf/2404.12711.pdfSyvällisempiä Kysymyksiä