Core Concepts
XGrad introduces weight prediction into popular gradient-based optimizers like SGD with momentum, Adam, AdamW, AdaBelief, and AdaM3 to boost their convergence and generalization when training deep neural network models.
Abstract
The paper proposes the XGrad framework, which incorporates weight prediction into the training of deep neural networks using gradient-based optimizers.
Key highlights:
XGrad derives the mathematical relationship between the initial weights and the future weights after s continuous updates for several popular optimizers, including SGD with momentum, RMSprop, Adam, AdamW, AdaBelief, and AdaM3.
XGrad uses the predicted future weights for both the forward pass and backward propagation, allowing the optimizer to utilize gradients with respect to the future weights to update the model parameters.
Extensive experiments on 19 different deep learning models spanning image classification, natural language processing, and image generalization tasks demonstrate that XGrad can consistently improve the model accuracy compared to the baseline optimizers.
For example, XGrad achieves an average of 0.98% top-1 accuracy improvement over SGD with momentum on CIFAR-10, and a 0.76% accuracy improvement over Adam along with a 0.74 higher BLEU score on the WMT-16 EN→De dataset.
Stats
XGrad can achieve an average of 0.98% top-1 accuracy improvement over SGD with momentum when training on the CIFAR-10 dataset.
Compared to Adam, XGrad averages a 0.76% accuracy improvement and obtains a 0.74 higher BLEU score when training GNMT-8 on the WMT-16 EN→De dataset.
Quotes
"XGrad is rather straightforward to implement yet pretty effective in boosting the convergence of gradient-based optimizers and the accuracy of DNN models."
"The experiment results demonstrate that XGrad can improve the model accuracy compared with the original optimizer."