Core Concepts
The core message of this paper is to propose a new method called Mapping-Emulation Knowledge Distillation (MEKD) that can effectively distill a black-box cumbersome model into a lightweight model without leaking the internal structure or parameters of the teacher model.
Abstract
The paper addresses the problem of Black-Box Knowledge Distillation (B2KD), where the internal structure and parameters of the teacher model hosted on a cloud server are invisible and unavailable to the edge device. The authors propose a two-step workflow consisting of deprivatization and distillation.
Deprivatization:
- The authors train a GAN using random noise as input to synthesize privacy-free images that can maximize the responses of the teacher model.
- The synthetic images are sent to the cloud server to obtain soft or hard inference responses, which are then used in the distillation step.
Distillation:
- The well-trained generator is frozen and grafted behind the teacher and student models, using the softened logits of both models as the generator input.
- The distance between the high-dimensional image points generated from the logits of the teacher and student models is minimized to drive the student model to mimic the output logits of the teacher model.
- The authors also use the Kullback-Leibler Divergence (KLD) between the softened logits of the teacher and student models as an additional loss function.
The authors provide theoretical analysis to show that reducing the distance between the high-dimensional image points can drive the alignment of the low-dimensional logits, and this optimization direction is different from direct logits alignment. Experimental results on various benchmarks demonstrate the effectiveness of the proposed MEKD method, which outperforms previous state-of-the-art approaches.
Stats
The teacher models achieve top-1 classification accuracy of 99.56% for ResNet56 and 99.52% for VGG13 on Syn.Digits.
The student MobileNet model achieves top-1 classification accuracy of 86.45% and 88.65% on SVHN when distilled from the ResNet56 and VGG13 teacher models, respectively.
Quotes
"Giving the student and teacher model fS and fT, for a data distribution μ ∈ X in image space which is mapped to PS ∈ Y and PT ∈ Y in latent space. If the Wasserstein distance between PS and PT equals zero, the student and teacher model are equivalent, i.e., fS = fT."
"Giving a prior distribution p ∈ RC, for a data distribution μ ∈ Rn, if the Wasserstein distance between generated distribution μ' = (fG)#p and μ equals zero, then the generator fG: RC → Rn is the inverse mapping of the teacher function fT: Rn → RC, denoted as fG = f −1
T."