Core Concepts
ATOM introduces a resilient distributed training framework for asynchronous training of large models in a decentralized setting, optimizing training throughput and efficiency.
Abstract
The advent of the Transformer architecture has revolutionized natural language processing (NLP) models, leading to significant advancements. However, challenges arise due to the lack of specialized hardware for training large-scale models. ATOM addresses these challenges by introducing a decentralized training framework that enables asynchronous training of vast models using cost-effective hardware like consumer-grade GPUs and Ethernet. Unlike traditional methods that distribute sub-models across GPUs, ATOM accommodates complete large language models on one host through seamless model swapping. This approach enhances training efficiency and scalability compared to conventional methods. The study showcases experiments with different GPT-3 model configurations, demonstrating up to 20x improvement in training efficiency with ATOM in scenarios with suboptimal network connections.
Stats
Our experiments using different GPT-3 model configurations reveal that ATOM can enhance training efficiency up to 20× when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.