The paper presents a two-stage approach, "LLM-shearing", to efficiently produce smaller yet competitive large language models (LLMs) by leveraging existing pre-trained models.
In the first stage, the authors propose "targeted structured pruning" to prune a larger source model (LLaMA2-7B) into a specified target architecture, such as Pythia-1.4B or INCITE-Base-3B. This pruning approach learns a set of masks to scale down different model substructures (layers, heads, hidden dimensions, etc.) in an end-to-end manner.
In the second stage, the authors continue pre-training the pruned model using a dynamic batch loading algorithm. This algorithm adjusts the data proportions from different domains (e.g., GitHub, C4) during training based on the loss reduction rates, ensuring efficient use of the pre-training data.
The authors demonstrate the effectiveness of their approach by producing two Sheared-LLaMA models - 1.3B and 2.7B parameters. Despite using only 50 billion tokens (5% of OpenLLaMA's pre-training budget) for pruning and continued pre-training, Sheared-LLaMA models outperform other open-source models of similar sizes on a wide range of downstream tasks and instruction tuning. The authors also provide analysis on the importance of dynamic batch loading and comparisons to other pruning approaches.
翻譯成其他語言
從原文內容
arxiv.org
深入探究