Core Concepts
The author explores the impact of hyperparameters on neural network training dynamics, revealing distinct regimes and their dependence on data size.
Abstract
The content delves into the different phases of stochastic gradient descent (SGD) in deep learning. It discusses how noise, batch size, and learning rate affect the training process and generalization error. The study provides insights into critical batch sizes, alignment dynamics, and performance variations in SGD across different architectures and datasets.
Key points include:
SGD's key hyperparameters are batch size (B) and learning rate (η).
Different regimes of SGD include noise-dominated, first-step-dominated, and gradient descent.
The critical batch size B* separates these regimes based on the training set size P.
The alignment between network output and labels is crucial for performance.
Small margin κ leads to lazy regime behavior while large margin κ requires weight inflation.
Effects of momentum, adaptive learning rates, and weight decay on SGD performance are discussed.
The study emphasizes the importance of understanding how hyperparameters influence neural network training dynamics to improve performance.
Stats
For small batches and large learning rates: ∥w⊥∥ ∼ T.
Critical batch size separating regimes: B* ∼ P γ.
Quotes
"The success of deep learning contrasts with its limited understanding."
"Our results explain the surprising observation that these hyperparameters strongly depend on the number of data available."