Large language model training

Distributed Memory-efficient Attention for Efficient Training of Long-context Large Language Models

DISTFLASHATTN, a distributed memory-efficient attention mechanism, efficiently distributes token chunks across multiple devices while maintaining the IO-aware benefits of memory-efficient attention. It introduces three key optimizations - load-balanced scheduling, overlapping communication and computation, and a rematerialization-aware gradient checkpointing strategy - to achieve high GPU utilization and low communication overhead for training long-context LLMs.

Scaling Transformer Models with µ-Transfer: A Comprehensive Empirical Study

Distributed Memory-efficient Attention for Efficient Training of Long-context Large Language Models