The author proposes MINILLM, a knowledge distillation approach that minimizes reverse KLD to distill large language models into smaller ones. Extensive experiments show superior performance in various metrics compared to standard KD methods.
Neither forward Kullback-Leibler (FKL) divergence nor reverse Kullback-Leibler (RKL) divergence exhibits the expected mean-seeking or mode-seeking behaviors in knowledge distillation for large language models. Instead, both FKL and RKL converge to the same optimization objective after a sufficient number of epochs. However, due to practical constraints, large language models are rarely trained for such an extensive number of epochs. The authors propose an Adaptive Kullback-Leiber (AKL) divergence method that adaptively allocates weights to combine FKL and RKL, focusing on aligning the head and tail parts of the distributions.
A new dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the teacher and student models to enhance the similarity between them and enable knowledge transfer, even for models with different vocabularies.