The author argues that MoE models underperform in downstream tasks due to weaker transfer capability compared to vanilla models. They propose transfer capability distillation using vanilla models as teachers to enhance MoE model performance.
Vanilla Transformers sind effektive Lehrer für die Transferfähigkeit von Modellen.