Supervised fine-tuning can effectively enhance the generalization capabilities of vision foundation models after pretraining.
Masked Token Modeling (MTM) can improve the storage efficiency of token-based vision model training by leveraging self-supervised pre-training, while TokenAdapt and ColorAdapt enhance the effectiveness of token-based data augmentation.