This paper presents a large-scale empirical study on the reliability and limitations of the µ-Transfer technique for scaling hyperparameters, particularly learning rates, across transformer models of varying sizes.
The key highlights and insights are:
The authors establish baseline results showing that µ-Transfer works reliably for scaling learning rates across transformer models ranging from 2M to 10B parameters, when using standard architectural choices.
The authors investigate the impact of various architectural modifications, such as using projection biases, RMSNorm gains, different attention scales, and multiplicative nonlinearities. They find that µ-Transfer is compatible with most of these changes, but can break down when using trainable scale parameters in the network.
The authors conduct the largest-scale µ-Transfer experiment to date, demonstrating that the optimal learning rate found for a 2M parameter model accurately predicts the optimum for a 10B parameter model.
The authors also explore the compatibility of µ-Transfer with other techniques like decoupled weight decay, large and small batch sizes, and the Lion optimizer. They find that µ-Transfer generally works well, with some exceptions.
Overall, the results provide a comprehensive empirical understanding of the strengths and limitations of the µ-Transfer technique for scaling transformer models, and offer guidance for practitioners on its practical application.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询