This research paper investigates the optimization dynamics of single-layer Transformers, specifically focusing on the impact of Softmax and Gaussian attention kernels on Gradient Descent (GD) convergence.
本研究證明了在特定條件下,隨著模型寬度和深度趨近於無限大,使用梯度流訓練的大型 Transformer 模型可以實現全局收斂,並揭示了 Transformer 模型訓練的理論基礎。
본 논문에서는 넓이와 깊이가 무한대로 갈 때, 가중치 감쇠 정규화를 사용한 대규모 트랜스포머 훈련에서 경사 하강법이 전역적 최소값으로 수렴함을 보여줍니다.
This paper provides theoretical guarantees for the global convergence of gradient flow in training large-scale Transformer models by analyzing their mean-field limit and demonstrating its approximation to the Wasserstein gradient flow.
RecurFormer enhances the efficiency of Transformer-based LLMs by strategically replacing certain self-attention heads, characterized by a recency-aware attention pattern, with the linear recurrent neural network architecture Mamba, leading to reduced cache size and improved inference speed while maintaining comparable generation quality.
Transformerは従来のニューラルネットワークアーキテクチャと比較して、損失ヘッセ行列の構造が根本的に異なっており、データ、重み、アテンションモーメントへの依存度が高く、非線形性も高いため、最適化が困難である。
The Transformer architecture, particularly its self-attention mechanism, exhibits a unique and complex loss landscape compared to traditional architectures like MLPs and CNNs, characterized by a highly non-linear and heterogeneous Hessian matrix with varying dependencies on data, weight, and attention moments.
Memory-augmented Transformers (Memformers) can effectively learn and implement sophisticated optimization algorithms, such as Conjugate Gradient Descent and other linear first-order methods, potentially leading to more efficient and generalizable optimization techniques.
While SignGD (and by extension, Adam) can train two-layer transformers on noisy data with fast convergence, the resulting models exhibit poor generalization due to memorizing noise instead of learning meaningful features.
Precomputing the first layer of transformers with RoPE can lead to lower latency and cost-per-token, optimizing inference speed.