Core Concepts
AdamW implicitly performs constrained optimization under the ℓ∞ norm constraint, converging to the KKT points of the constrained problem.
Abstract
The content analyzes the implicit bias of the AdamW optimization algorithm. Key highlights:
AdamW, a variant of the popular Adam optimizer, achieves better optimization and generalization performance compared to Adam with ℓ2 regularization. However, the theoretical understanding of this advantage is limited.
The main result shows that in the full-batch setting, if AdamW converges, it must converge to a KKT point of the original loss function under the constraint that the ℓ∞ norm of the parameters is bounded by the inverse of the weight decay factor.
This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to the ℓ∞ norm. There is also a surprising connection between normalized steepest descent with weight decay and the Frank-Wolfe algorithm.
The analysis also provides a convergence bound for normalized steepest descent with weight decay for convex loss functions, showing that it can converge much faster than normalized gradient descent with weight decay when the loss function has better properties under the ℓ∞ geometry.
Experiments on a language modeling task and a synthetic problem validate the theoretical predictions about the relationship between the ℓ∞ norm of the parameters and the AdamW hyperparameters.
Stats
The content does not contain any key metrics or important figures to support the author's key logics.
Quotes
The content does not contain any striking quotes supporting the author's key logics.