Bibliographic Information: Li, Z., Zhang, X., Zhong, P., Deng, Y., Razaviyayn, M., & Mirrokni, V. (2024). Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models. arXiv preprint arXiv:2410.06441v1.
Research Objective: This paper introduces Addax, a novel optimization algorithm designed to address the memory challenges associated with fine-tuning large language models (LLMs). The authors aim to demonstrate that Addax can achieve comparable or superior performance to existing methods while significantly reducing memory requirements.
Methodology: Addax combines first-order and zeroth-order gradient estimates to optimize memory usage during fine-tuning. It strategically assigns data batches to either in-place Stochastic Gradient Descent (IP-SGD) or Memory-Efficient Zeroth-order Optimizer (MeZO) based on the input sequence length. Shorter sequences, requiring less memory for gradient computation, are assigned to IP-SGD, while longer sequences are assigned to MeZO. This approach allows Addax to maintain a memory footprint similar to MeZO while benefiting from the faster convergence of IP-SGD.
Key Findings: The authors conducted extensive experiments on various LLM architectures (masked LM and autoregressive LM), model scales (350M to 70B parameters), and tasks (classification, multiple-choice questions, content generation). Their results demonstrate that Addax consistently outperforms MeZO in terms of accuracy and convergence speed while maintaining a comparable memory footprint. Notably, Addax successfully fine-tuned the OPT-13B model on a single A100 GPU across all nine tasks, while SGD and IP-SGD failed due to memory limitations. Furthermore, Addax achieved superior performance compared to MeZO, SGD, and IP-SGD when fine-tuning larger models like OPT-30B, OPT-66B, and Llama-2-70B.
Main Conclusions: Addax offers a practical and efficient solution for fine-tuning LLMs, particularly in resource-constrained settings. Its ability to balance memory efficiency, convergence speed, and model performance makes it a valuable tool for researchers and practitioners working with large-scale language models.
Significance: This research addresses a crucial bottleneck in LLM fine-tuning, enabling wider accessibility to advanced language processing capabilities. By reducing the hardware requirements for fine-tuning, Addax facilitates research and development in the field, particularly for individuals and institutions with limited resources.
Limitations and Future Research: While Addax introduces a new hyperparameter (α), requiring tuning for optimal performance, the authors acknowledge the need for further investigation into the regularization effects of zeroth-order gradient estimates on first-order methods. Future research could explore the application of Addax in pre-training tasks and its potential synergy with other memory-efficient techniques like PEFT, quantization, and memory-efficient attention mechanisms.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések