Knowledge Distillation via Target-aware Transformer for Efficient Model Compression
The proposed target-aware transformer enables the student model to dynamically aggregate semantic information from the teacher model, allowing the student to mimic the teacher as a whole rather than minimizing each partial divergence in a one-to-one spatial matching fashion.