Core Concepts
Transformer models can be trained deeper and more effectively by utilizing a unified signal propagation theory to address issues like vanishing/exploding gradients and rank collapse.
Abstract
The content introduces a signal propagation theory for transformer models, enabling the training of very deep models with improved performance. It addresses issues such as gradient instability and rank collapse, providing insights into the moments of transformer components and blocks. The proposed DeepScaleLM scheme allows for the training of 100s of layer models across various tasks and modalities, showcasing significant improvements in language modeling, speech translation, image classification, and question answering tasks.
Introduction:
Transformer models face challenges with gradient instability.
Proposed remedies include residual scaling and modified layernorms.
Theoretical analysis on signal propagation is crucial for understanding these issues.
Moment Control & Residual Scaling:
Bounded gradients lead to better convergence.
Different scaling schemes explored for residual networks.
Learnable parameters impact model stability and performance.
Applications:
Explaining variance explosion in transformers.
Impact of large QK values on training stability.
Mitigating rank collapse through proper initialization.
Results:
DeepScaleLM enables training deeper-narrower models effectively.
Performance improvements observed across various tasks and modalities.
Stats
"DeepNet stabilizes the model training at the expense of reduced “sensitivity” by using smaller effective values of β2."
"DSInit stabilizes the gradient but reduces model expressivity with depth."
"DSLM outperforms other methods by stabilizing training while maintaining model expressivity."
Quotes
"Our derived equations are empirically verified within strict error bounds with real world data."
"Our formulae predict observed norms with remarkable accuracy."