Core Concepts
Momentum does not provide significant optimization or generalization benefits in stochastic gradient descent when the learning rate is small, as SGD and SGD with Momentum (SGDM) exhibit similar behaviors in both short-term and long-term training.
Abstract
The paper investigates the role of momentum in stochastic gradient descent (SGD) optimization, particularly in settings where the learning rate is small. The key findings are:
Theoretical Analysis:
In the short-term (O(1/η) steps), SGD and SGDM are shown to be weakly approximated, meaning their distributions are similar. This holds even when the gradient noise scales inversely with the learning rate.
In the long-term (O(1/η^2) steps), SGD and SGDM are proven to have the same limiting dynamics when the iterates are close to a manifold of local minimizers. This suggests momentum does not provide extra generalization benefits over SGD in this regime.
Empirical Verification:
Experiments on ImageNet training of ResNet-50 and fine-tuning of RoBERTa-large confirm that SGD and SGDM perform comparably when the optimal learning rate is small.
For large-batch training on CIFAR-10, the authors show that the benefit of momentum can be attributed to its ability to mitigate curvature-induced impacts, rather than noise reduction. When the curvature-induced impact is reduced, the performance gap between SGD and SGDM diminishes.
The paper concludes that momentum provides limited benefits in practical training regimes where the optimal learning rate is not very large, and that model performance is generally insensitive to the choice of momentum hyperparameters in such settings.
Stats
The paper does not provide any specific numerical data or metrics to support the claims. The analysis is primarily theoretical, with empirical results presented in a qualitative manner.