Conceitos essenciais
Transformers learn feature-position correlations in masked image modeling.
Resumo
1. Introduction
- Self-supervised learning dominant in pretraining neural networks.
- Rise of masked image modeling (MIM) in vision pretraining.
- MIM focuses on reconstructing masked patches in images.
2. Problem Setup
- MIM framework for predicting masked patches.
- Data distribution with spatial structures.
- Transformer architecture for MIM.
3. Attention Patterns and Feature-Position Correlations
- Significance of feature-position correlations.
- Comparison with existing theoretical studies.
4. Main Results
- Theoretical analysis of learning dynamics in transformers.
- Global convergence of loss function and attention patterns.
5. Overview of the Proof Techniques
- Gradient dynamics of attention correlations.
- Phases of learning FP correlations in different scenarios.
6. Experiments
- Introduction of attention diversity metric.
- Evaluation of attention patterns in self-attention mechanisms.
Estatísticas
"For each cluster Dk, k ∈ [K], there is a corresponding partition of P into Nk disjoint subsets P = SNk j=1 Pk,j which we call areas."
"The distribution of zj(X) can be arbitrary within the above support set."
Citações
"Transformers exhibit an area-wide pattern of attention, concentrating on unmasked patches within the same area."
"Understanding how the model trains and converges towards accurate image reconstruction can be achieved by examining how the attention mechanism evolves."