Asymmetric Masked Distillation for Efficient Pre-Training of Small Vision Transformer Models
The authors propose an asymmetric masked distillation (AMD) framework to pre-train relatively small vision transformer models efficiently. The key idea is to enable the teacher model to acquire more context information through a lower masking ratio, while maintaining the reconstruction difficulty for the student model.