Scaling Transformer Models with µ-Transfer: A Comprehensive Empirical Study
Empirical investigation of the reliability and limitations of the µ-Transfer technique for scaling hyperparameters, particularly learning rates, across transformer models of varying sizes.