Core Concepts
The authors propose a new dual-joint search space for neural optimizer search (NOS) that simultaneously optimizes the weight update equation, internal decay functions, and learning rate schedules. They discover multiple optimizers, learning rate schedules, and Adam variants that outperform standard deep learning optimizers across image classification tasks.
Abstract
The authors present a new approach for neural optimizer search (NOS) that expands on previous work by simultaneously optimizing the weight update equation, internal decay functions, and learning rate schedules.
Key highlights:
Proposed a new dual-joint search space for NOS that incorporates the latest research on deep learning optimizers, including concepts like quasi-hyperbolic momentum, AdaBelief, AMSGrad, and more.
Developed an integrity check to efficiently eliminate degenerate optimizers and a problem-specific, mutation-only genetic algorithm that can be massively parallelized.
Discovered multiple optimizers, learning rate schedules, and Adam variants that outperformed standard deep learning optimizers like Adam, SGD, and RMSProp across image classification tasks on CIFAR-10, CIFAR-100, TinyImageNet, Flowers102, Cars196, and Caltech101.
The discovered optimizers leverage concepts like quasi-hyperbolic momentum, adaptive learning rates, and custom decay functions to achieve superior performance.
Conducted supplementary experiments to obtain Adam variants and new learning rate schedules for Adam, further expanding the set of powerful optimizers.
Demonstrated the importance of the jointly learned decay functions and learning rate schedules in the discovered optimizers, as removing them degraded performance.
The authors' comprehensive approach to NOS, incorporating the latest advancements in deep learning optimizers, has led to the discovery of highly effective optimizers that can serve as drop-in replacements for standard optimizers across a variety of image recognition tasks.
Stats
The authors report the following key metrics:
CIFAR-10 test accuracy up to 96.23%
CIFAR-100 test accuracy up to 79.80%
Flowers102 test accuracy up to 97.76%
Cars196 test accuracy up to 91.79%
Caltech101 test accuracy up to 92.76%
TinyImageNet test accuracy up to 48.82%
Quotes
"The success of Opt6 is heavily dependent upon the inherently learned LR2 schedule, as Opt61 always under-performed Opt6 when training from scratch."
"Opt101 always out-performed Nesterov's momentum when training from scratch. We empirically noticed that Opt101 liked large learning rates around 10. We believe that the double scaling of the gradients clips larger gradient values, allowing for larger learning rates to scale gradients near zero to have more of an effect, which empirically seems beneficial."