The content discusses a novel token growth scheme called Token Expansion (ToE) to efficiently train Vision Transformers (ViTs). The key highlights are:
ToE introduces an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of the original Transformers, preventing the loss of crucial learnable information during the accelerated training process.
ToE can be seamlessly integrated into the training and fine-tuning process of popular Transformers like DeiT and LV-ViT, without modifying the original training hyper-parameters, architecture, and strategies.
Extensive experiments demonstrate that ToE achieves about 1.3× faster training for ViTs in a lossless manner or even with performance gains over the full-token training baselines, outperforming previous SOTA methods.
ToE can also be effectively combined with the efficient training framework EfficientTrain to further improve the training efficiency.
The transfer learning ability of ToE is evaluated by fine-tuning DeiT on CIFAR-10/100, showing that ToE pre-trained weights can improve the fine-tuning accuracy.
Ablation studies verify the effectiveness of the proposed "initialization-expansion-merging" pipeline and the robustness of ToE to different speedup factors.
翻譯成其他語言
從原文內容
arxiv.org
深入探究