Core Concepts
FRUGAL, a novel optimization framework, enhances the training of large language models by combining state-full optimization (e.g., Adam) for a select subset of parameters with state-free methods (e.g., signSGD) for the remaining parameters, achieving near-state-of-the-art performance with significantly reduced memory footprint.
Stats
Training an 8 billion parameter LLaMA model in a 16-bit format requires 16GB for storing the parameters, 16GB for gradients, and an additional 32GB for the Adam optimizer's m and v statistics, totaling 64GB of memory.
Storing master weights and optimizer statistics in 32-bit format for mixed-precision training leads to memory demands exceeding the capacity of cutting-edge graphics cards like the A100-80GB.
In experiments, FRUGAL consistently outperformed memory-efficient baselines (GaLore, BAdam) across LLaMA model sizes (60M, 130M, 350M, 1B) with the same memory budget (density ρ = 0.25), approaching the performance of the memory-intensive Adam optimizer.
FRUGAL with zero density (ρ = 0.0), where only Embeddings, RMSNorms, and Logits are state-full, outperformed GaLore and BAdam with ρ = 0.25, indicating the potential of state-free methods for a large portion of parameters.
Logits layer exhibited significantly higher sensitivity to the change from Adam to signSGD compared to Embeddings and RMSNorms, highlighting its importance in requiring advanced optimization techniques.
Quotes
"This solution allows for high-dimensional updates, which provides additional opportunities to explore the parameter space and improves convergence."
"These findings underscore the potential of state-free algorithms for updating a substantial portion of the parameter space, paving the way for efficient, scalable optimization methods that deliver high performance without the significant memory costs traditionally associated with state-of-the-art optimizers."
"The results show that our method significantly outperforms previous memory-efficient algorithms while using the same memory budget."
"We demonstrate that only the Logits layer in transformer-like models requires advanced optimizers like Adam, while other modules (including Embeddings and RMSNorms) can use simpler methods like signSGD without significant performance loss."