מושגי ליבה
Over-parameterizing student models during knowledge distillation using Matrix Product Operators (MPO) enhances their performance without increasing inference latency, effectively transferring knowledge from larger teacher models.
Zhan, Y.-L., Lu, Z.-Y., Sun, H., & Gao, Z.-F. (2024). Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation. Advances in Neural Information Processing Systems, 36. arXiv:2411.06448v1 [cs.AI]
This paper proposes a novel method to improve the effectiveness of knowledge distillation by over-parameterizing student models during training using Matrix Product Operators (MPO) for tensor decomposition. This approach aims to bridge the capacity gap between teacher and student models, enhancing knowledge transfer without increasing inference latency.