The paper proposes a novel method called "Grad Queue" (GQ) to address the challenge of preserving informative gradients in large batch updates. The key insights are:
Gradients can be classified into three categories: monotonous, sparse, and noisy. Monotonous gradients corresponding to abundant samples can suppress the sparse gradients associated with rare but important features.
GQ maintains a finite queue of recent gradients to compute their expected statistics. It then applies a distance-based amplification function to boost the sparse gradients and dampen the monotonous ones.
For batch sizes beyond the optimal range, the method clusters the samples based on their feature representations and applies the GQ amplification to the cluster centers. This ensures that the sparse gradients within each cluster are preserved and weighted accordingly.
The length of the gradient queue is made variable, tracking the trend in loss convergence to focus on the most relevant past gradients.
Theoretical analysis is provided to show how GQ can lower the threshold for sparse gradients to align with the momentum update, improving their influence on the optimization.
Experiments on CIFAR-10, MNIST, and Reuters News datasets demonstrate the superior performance of GQ-boosted optimizers compared to vanilla SGD and ADAM, especially for large batch sizes.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Irfan Mohamm... at arxiv.org 04-29-2024
https://arxiv.org/pdf/2404.16917.pdfDeeper Inquiries