Dichotomy of Early and Late Phase Implicit Biases Provably Induces Grokking in Neural Network Training
The dichotomy of early and late phase implicit biases induced by large initialization and small weight decay can provably lead to a sharp transition from memorization to generalization, a phenomenon known as "grokking", in the training of homogeneous neural networks.