Centrala begrepp
This research paper investigates the optimization dynamics of single-layer Transformers, specifically focusing on the impact of Softmax and Gaussian attention kernels on Gradient Descent (GD) convergence.
Statistik
Embedding dimension D = 64
Hidden dimension d = 128
Number of attention heads H = 2
Text Classification task: batch size of 16, learning rate of 1 × 10−4
Pathfinder task: batch size of 128, learning rate of 2 × 10−4
Citat
"Our findings demonstrate that, with appropriate weight initialization, GD can train a Transformer model (with either kernel type) to achieve a global optimal solution, especially when the input embedding dimension is large."
"Nonetheless, certain scenarios highlight potential pitfalls: training a Transformer using the Softmax attention kernel may sometimes lead to suboptimal local solutions. In contrast, the Gaussian attention kernel exhibits a much favorable behavior."