Grunnleggende konsepter
The Transformer architecture, particularly its self-attention mechanism, exhibits a unique and complex loss landscape compared to traditional architectures like MLPs and CNNs, characterized by a highly non-linear and heterogeneous Hessian matrix with varying dependencies on data, weight, and attention moments.
Statistikk
The query Hessian block entries are significantly smaller than those of the value block.
Removing softmax from self-attention makes the magnitudes of the Hessian entries more homogeneous across blocks.
Pre-LN addresses the block-heterogeneity with respect to data scaling laws.
Sitater
"Transformers are usually trained with adaptive optimizers like Adam(W) (Kingma & Ba, 2015; Loshchilov & Hutter, 2019) and require architectural extensions such as skip connections (He et al., 2016) and layer normalization (Xiong et al., 2020), learning rate warmu-p (Goyal et al., 2017), and using different weight initializations (Huang et al., 2020)."
"Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters."
"Ultimately, our findings provide a deeper understanding of the Transformer’s unique optimization landscape and the challenges it poses."