A novel first-order method based on the Gauss-Newton approach is proposed to efficiently solve the min-max optimization problem in training generative adversarial networks (GANs). The method uses a fixed-point iteration with a Gauss-Newton preconditioner and achieves state-of-the-art performance on image generation tasks while maintaining computational efficiency.
The authors propose a new dual-joint search space for neural optimizer search (NOS) that simultaneously optimizes the weight update equation, internal decay functions, and learning rate schedules. They discover multiple optimizers, learning rate schedules, and Adam variants that outperform standard deep learning optimizers across image classification tasks.
The core message of this paper is to propose a novel optimizer called Variational Stochastic Gradient Descent (VSGD) that combines gradient descent with probabilistic modeling of the true gradients as latent random variables. This approach allows for more principled modeling of gradient noise and uncertainty, leading to improved optimization performance compared to existing methods like ADAM and SGD.
Sharpness-Aware Minimization (SAM) is a gradient-based neural network training algorithm that explicitly seeks to find solutions that avoid "sharp" minima. The authors derive an "edge of stability" for SAM, which depends on the norm of the gradient, and show empirically that SAM operates at this edge of stability across multiple deep learning tasks.
XGrad introduces weight prediction into popular gradient-based optimizers like SGD with momentum, Adam, AdamW, AdaBelief, and AdaM3 to boost their convergence and generalization when training deep neural network models.
This paper establishes quantitative convergence results for the value functions and optimal parameters of neural SDEs as the sample size grows to infinity. The authors analyze the Hamilton-Jacobi-Bellman equation corresponding to the N-particle system and obtain uniform regularity estimates, which are then used to show the convergence of the minima of objective functionals and optimal parameters.
Injecting randomness at various stages of the deep learning training process, including data, model, optimization, and learning, can significantly improve performance across computer vision benchmarks.
The core message of this paper is that there exists a complex relationship between neural collapse and plasticity loss in deep learning models, which can be leveraged to mitigate plasticity loss.
The authors propose a new optimization algorithm named CG-like-Adam that combines the advantages of conjugate gradient and adaptive moment estimation to speed up training and enhance the performance of deep neural networks.
LUT-GEMM, an efficient kernel for quantized matrix multiplication, eliminates the resource-intensive dequantization process and reduces computational costs compared to previous kernels for weight-only quantization, enabling substantial acceleration of token generation latency in large-scale generative language models.