Core Concepts
提案されたJoint MLP/Attention (JoMA)ダイナミクスは、多層Transformerアーキテクチャのトレーニング手順を理解するための新しい数学的枠組みを提供します。
Abstract
Abstract:
JoMA introduces a novel mathematical framework to understand the training procedure of multilayer Transformer architectures.
It integrates self-attention and MLP layers, explaining how attention becomes sparse and then dense during training.
Introduction:
Transformers have been widely used in various applications due to their effectiveness.
Understanding the learning mechanism of multi-layer transformers remains a challenge.
Proposed Framework - JoMA:
JoMA removes unrealistic assumptions from previous analyses and explains the dynamics of attention in multilayer Transformers.
It shows how tokens are combined hierarchically in the presence of nonlinear activations.
Related Work:
Previous studies have explored the expressiveness and training dynamics of attention-based models.
Training Dynamics:
The dynamics of weights and attention sparsity change over time during training.
Different learning rates affect attention sparsity patterns.
Experiments:
Experiments validate the alignment between latent variables and hidden nodes in MLP layers.
Attention sparsity patterns are observed in real-world datasets and pre-trained models.
Discussion:
Considerations for almost orthogonal embeddings, training embedding vectors, and self-attention computation from embeddings are discussed.
Conclusion:
JoMA provides insights into the joint dynamics of MLP and attention layers in multilayer Transformers, enhancing understanding of their training mechanisms.
Stats
JoMA incorporates residual connections and MLP nonlinearity as key ingredients.
Experiments on Wikitext2/Wikitext103 verify theoretical findings.