Comparing the Effectiveness of Knowledge Distillation and Pretraining from Scratch under a Fixed Computation Budget
Under a fixed computation budget, pretraining from scratch can be as effective as vanilla knowledge distillation, but more advanced distillation strategies like TinyBERT and MiniLM still outperform pretraining from scratch.