The paper presents a two-stage approach, "LLM-shearing", to efficiently produce smaller yet competitive large language models (LLMs) by leveraging existing pre-trained models.
In the first stage, the authors propose "targeted structured pruning" to prune a larger source model (LLaMA2-7B) into a specified target architecture, such as Pythia-1.4B or INCITE-Base-3B. This pruning approach learns a set of masks to scale down different model substructures (layers, heads, hidden dimensions, etc.) in an end-to-end manner.
In the second stage, the authors continue pre-training the pruned model using a dynamic batch loading algorithm. This algorithm adjusts the data proportions from different domains (e.g., GitHub, C4) during training based on the loss reduction rates, ensuring efficient use of the pre-training data.
The authors demonstrate the effectiveness of their approach by producing two Sheared-LLaMA models - 1.3B and 2.7B parameters. Despite using only 50 billion tokens (5% of OpenLLaMA's pre-training budget) for pruning and continued pre-training, Sheared-LLaMA models outperform other open-source models of similar sizes on a wide range of downstream tasks and instruction tuning. The authors also provide analysis on the importance of dynamic batch loading and comparisons to other pruning approaches.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Mengzhou Xia... at arxiv.org 04-12-2024
https://arxiv.org/pdf/2310.06694.pdfDeeper Inquiries