Removing the square root from adaptive gradient methods can close the generalization gap on convolutional architectures while maintaining performance on transformers, highlighting the overlooked role of adaptivity in their success.
AdEMAMix, a novel optimizer, can leverage very old gradients to reach better solutions faster compared to the widely used Adam optimizer. This is achieved by combining a fast-changing and a slow-changing exponential moving average of gradients.