Empirical investigation of the reliability and limitations of the µ-Transfer technique for scaling hyperparameters, particularly learning rates, across transformer models of varying sizes.
DISTFLASHATTN, a distributed memory-efficient attention mechanism, efficiently distributes token chunks across multiple devices while maintaining the IO-aware benefits of memory-efficient attention. It introduces three key optimizations - load-balanced scheduling, overlapping communication and computation, and a rematerialization-aware gradient checkpointing strategy - to achieve high GPU utilization and low communication overhead for training long-context LLMs.