Conceptos Básicos
This paper provides theoretical guarantees for the global convergence of gradient flow in training large-scale Transformer models by analyzing their mean-field limit and demonstrating its approximation to the Wasserstein gradient flow.
Gao, C., Cao, Y., Li, Z., He, Y., Wang, M., Liu, H., Klusowski, J. M., & Fan, J. (2024). Global Convergence in Training Large-Scale Transformers. Advances in Neural Information Processing Systems, 38.
This paper investigates the optimization guarantees of large-scale Transformer models, aiming to prove the global convergence of gradient flow during training.