The paper investigates the behavior of Sharpness-Aware Minimization (SAM), a variant of gradient descent (GD) that has been shown to improve generalization in neural network training. The authors first derive an "edge of stability" for SAM, analogous to the 2/η value identified for GD, which depends on the norm of the gradient in addition to the step size η and the SAM radius ρ.
The authors then conduct experiments on three deep learning tasks - training a fully connected network on MNIST, a convolutional network on CIFAR10, and a Transformer language model on tiny_shakespeare. In all cases, they observe that the operator norm of the Hessian closely tracks the SAM edge of stability derived in the analysis, even when using stochastic gradients. This is in contrast to GD, where the Hessian norm often reaches and fluctuates around 2/η.
A key observation is that the SAM edge of stability is often much smaller than 2/η, especially early in training. This suggests that SAM drives the solutions toward smoother regions of parameter space while the loss is still large, rather than first minimizing the training error and then drifting to wider minima.
The authors also examine the alignment between the gradients used by SAM and the principal eigenvector of the Hessian. They find that the SAM gradients tend to be more closely aligned, which may help explain SAM's success in reducing the operator norm of the Hessian.
Overall, the paper provides theoretical and empirical insights into how SAM interacts with the edge-of-stability phenomenon to drive neural network training toward solutions with smoother Hessians, which can improve generalization.
Іншою мовою
із вихідного контенту
arxiv.org
Ключові висновки, отримані з
by Philip M. Lo... о arxiv.org 04-10-2024
https://arxiv.org/pdf/2309.12488.pdfГлибші Запити