The author introduces the "Align-to-Distill" strategy, utilizing a trainable Attention Alignment Module to address feature mapping issues in knowledge distillation for NMT.
A2D introduces a novel strategy for knowledge distillation by aligning attention heads between teacher and student models, leading to improved performance in NMT tasks.