The paper focuses on the convergence analysis of two popular adaptive optimizers, RMSProp and Adam, under the most relaxed assumptions of coordinate-wise generalized (L0, L1)-smoothness and affine noise variance.
Key highlights:
For RMSProp, the authors address several major challenges such as the dependence between stepsize and gradient, potential unbounded gradients, and additional error terms due to (L0, L1)-smoothness. They develop novel techniques to bound these terms and show that RMSProp with proper hyperparameters converges to an ε-stationary point with an iteration complexity of O(ε^-4), matching the lower bound.
For Adam, the authors face additional challenges due to the mismatch between gradient and first-order momentum. They develop a new upper bound on the first-order term in the descent lemma and show that Adam with proper hyperparameters also converges to an ε-stationary point with an iteration complexity of O(ε^-4).
The results improve upon prior work by considering the more practical and challenging coordinate-wise (L0, L1)-smooth objectives and the refined affine noise variance assumption, which better capture the training of neural networks.
The authors' analyses are comprehensive, providing detailed technical insights to address the key challenges, and the final convergence results match the optimal lower bound, demonstrating the tightness of their analyses.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問