Analyzing Local Representations of Self-supervised Vision Transformers: Comparative Study
Contrastive learning methods like DINO produce more universal patch representations compared to masked image modeling, enhancing downstream tasks without fine-tuning.