The paper studies the grokking phenomenon, where a neural network first "memorizes" the training set with perfect training accuracy but near-random test accuracy, and then suddenly transitions to perfect test accuracy after further training. The authors show that this phenomenon can be provably induced by a dichotomy of early and late phase implicit biases in the training process.
Specifically, the authors consider training homogeneous neural networks (e.g., MLPs and CNNs with ReLU activation) with large initialization and small weight decay. They prove that in the early phase, the training process gets trapped at a solution corresponding to a kernel predictor, leading to overfitting on the training set. However, after a sharp transition around 1/λ log α (where λ is the weight decay and α is the initialization scale), the training dynamics escape this kernel regime and converge to a min-norm/max-margin predictor, resulting in a dramatic improvement in test accuracy.
The authors provide concrete examples of this phenomenon in linear classification with diagonal linear nets and low-rank matrix completion with overparameterized models. They also show that the opposite can happen, where the early phase bias leads to good generalization but the late phase bias causes a sudden drop in test accuracy, a phenomenon they call "misgrokking".
The key insight is that the large initialization induces a strong early phase bias towards kernel predictors, which do not generalize well, but this bias decays over time and competes with a late phase bias towards min-norm/max-margin predictors induced by the small weight decay, leading to the sharp transition.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Kaifeng Lyu,... alle arxiv.org 04-03-2024
https://arxiv.org/pdf/2311.18817.pdfDomande più approfondite