Bibliographic Information: Han, A., Li, J., Huang, W., Hong, M., Takeda, A., Jawanpuria, P., & Mishra, B. (2024). SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining. Advances in Neural Information Processing Systems, 38.
Research Objective: This paper proposes a novel method, SLTrain, to address the computational and memory demands of pretraining large language models (LLMs) by leveraging sparse and low-rank matrix representations.
Methodology: SLTrain parameterizes weight matrices as the sum of low-rank (modeled via matrix factorization) and sparse matrices (with a fixed random support). This approach reduces the number of trainable parameters and memory consumption during training. The authors evaluate SLTrain on various LLaMA language models, ranging from 60M to 7B parameters, using the C4 dataset for pretraining. They compare SLTrain's performance to full-rank training and other low-rank pretraining methods like ReLoRA and GaLore, assessing perplexity, parameter size, and memory usage.
Key Findings: SLTrain achieves comparable perplexity scores to full-rank training and GaLore while significantly reducing parameter size and memory cost. For instance, SLTrain reduces memory requirements by up to 73% when pretraining the LLaMA 7B model compared to traditional methods.
Main Conclusions: Combining sparse and low-rank matrix factorization is a viable strategy for efficient LLM pretraining. SLTrain demonstrates that this approach can achieve competitive performance with significantly reduced resource requirements, making it suitable for training larger models on less powerful hardware.
Significance: This research contributes significantly to the field of LLM pretraining by offering a practical solution to the memory and computational bottlenecks. SLTrain's efficiency has the potential to democratize access to LLM training and facilitate the development of even larger and more powerful language models.
Limitations and Future Research: While SLTrain demonstrates promising results, further investigation into dynamic support learning for the sparse factor and theoretical analysis of convergence and loss landscape could further enhance its performance and understanding. Exploring its applicability to other model architectures, such as vision and multi-modal foundation models, is also a promising avenue for future research.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問