The ADOPT algorithm overcomes the convergence limitations of Adam and its variants in smooth nonconvex optimization by decorrelating the second moment estimate from the current gradient and changing the order of momentum update and normalization, achieving optimal convergence rate without relying on specific hyperparameter choices or bounded noise assumptions.
This paper derives and validates square root scaling rules for the adaptive gradient optimization algorithms RMSprop and Adam, using stochastic differential equations (SDEs) to model the algorithms' behavior and analyze the impact of batch size on their performance.
AdEMAMix, a novel optimizer, can leverage very old gradients to reach better solutions faster compared to the widely used Adam optimizer. This is achieved by combining a fast-changing and a slow-changing exponential moving average of gradients.
Removing the square root from adaptive gradient methods can close the generalization gap on convolutional architectures while maintaining performance on transformers, highlighting the overlooked role of adaptivity in their success.