Sign In

Sheared LLaMA: Accelerating Large Language Model Pre-Training via Structured Pruning

Core Concepts
Leveraging structured pruning and continued pre-training, we can produce smaller yet competitive large language models that require only a fraction of the compute budget compared to training from scratch.
The paper presents a two-stage approach, "LLM-shearing", to efficiently produce smaller yet competitive large language models (LLMs) by leveraging existing pre-trained models. In the first stage, the authors propose "targeted structured pruning" to prune a larger source model (LLaMA2-7B) into a specified target architecture, such as Pythia-1.4B or INCITE-Base-3B. This pruning approach learns a set of masks to scale down different model substructures (layers, heads, hidden dimensions, etc.) in an end-to-end manner. In the second stage, the authors continue pre-training the pruned model using a dynamic batch loading algorithm. This algorithm adjusts the data proportions from different domains (e.g., GitHub, C4) during training based on the loss reduction rates, ensuring efficient use of the pre-training data. The authors demonstrate the effectiveness of their approach by producing two Sheared-LLaMA models - 1.3B and 2.7B parameters. Despite using only 50 billion tokens (5% of OpenLLaMA's pre-training budget) for pruning and continued pre-training, Sheared-LLaMA models outperform other open-source models of similar sizes on a wide range of downstream tasks and instruction tuning. The authors also provide analysis on the importance of dynamic batch loading and comparisons to other pruning approaches.
The LLaMA2-7B model requires 2 trillion tokens for pre-training. Sheared-LLaMA models are pre-trained on 50 billion tokens from the RedPajama dataset. OpenLLaMA-3B-v1 and OpenLLaMA-3B-v2 are pre-trained on 1 trillion tokens. Pythia, OPT, and INCITE-Base are pre-trained on 300 billion, 300 billion, and 800 billion tokens respectively. TinyLlama is pre-trained on 3 trillion tokens.
"Sheared-LLaMA-1.3B outperforms TinyLlama-1.1B, despite TinyLlama-1.1B being pre-trained on 3T tokens—more than the total data used for pre-training LLAMA2-7B and our pruning process combined." "Dynamic batch loading loads more data from the Book and C4 subsets, indicating that these domains are more challenging for a pruned model to recover."

Key Insights Distilled From

by Mengzhou Xia... at 04-12-2024
Sheared LLaMA

Deeper Inquiries

How can the LLM-shearing approach be extended to handle specialized domains or tasks that are not well-covered in the pre-training data?

The LLM-shearing approach can be extended to handle specialized domains or tasks by incorporating domain-specific fine-tuning strategies. While the pre-training data may not cover all possible domains comprehensively, the structured pruning process can still be applied to create a smaller LLM. To address the lack of coverage in specialized domains, the pruned model can undergo additional fine-tuning on domain-specific datasets. This fine-tuning process allows the model to adapt and specialize in areas where the pre-training data may be lacking. By fine-tuning the pruned model on domain-specific tasks, it can acquire domain-specific knowledge and improve its performance in those areas.

What are the potential limitations of the structured pruning approach, and how can it be further improved to maintain model performance at higher compression rates?

One potential limitation of the structured pruning approach is the risk of performance degradation as the compression rate increases. As more parameters are pruned, there is a higher likelihood of losing critical information and model capacity, leading to decreased performance. To address this limitation and maintain model performance at higher compression rates, several improvements can be implemented: Fine-grained Pruning: Implement more fine-grained pruning techniques that selectively prune less important parameters while preserving essential information. This approach can help maintain model performance even at higher compression rates. Dynamic Pruning Strategies: Develop dynamic pruning strategies that adaptively adjust the pruning process based on the model's performance during training. By dynamically pruning different parts of the model at different stages, the approach can optimize the compression process while preserving performance. Regularization Techniques: Incorporate regularization techniques during the pruning process to prevent over-pruning and ensure that important parameters are retained. Techniques like sparsity-inducing penalties can help maintain model capacity while reducing the number of parameters. Iterative Pruning: Implement an iterative pruning approach where the model is pruned gradually in multiple stages, allowing for fine-tuning and retraining between each pruning step. This iterative process can help maintain performance by giving the model opportunities to recover from potential performance drops.

Given the promising results on smaller LLMs, how could this approach be scaled up to produce competitive large-scale LLMs that rival the performance of models trained from scratch on massive datasets?

To scale up the LLM-shearing approach and produce competitive large-scale LLMs that rival models trained from scratch on massive datasets, several strategies can be employed: Increased Compute Budget: Allocate a larger compute budget for the pruning and continued pre-training stages to handle larger models effectively. With more resources, the approach can be applied to larger LLMs, allowing for better performance and scalability. Domain-Specific Pre-training Data: Incorporate domain-specific pre-training data to enhance the model's knowledge and performance in specialized areas. By leveraging diverse and extensive datasets, the model can achieve better generalization and competitiveness. Ensemble Methods: Implement ensemble methods by combining multiple pruned models to create a more robust and diverse model. By aggregating the strengths of different pruned models, the ensemble can achieve higher performance levels and compete with models trained from scratch. Transfer Learning: Utilize transfer learning techniques to transfer knowledge from pre-trained models to larger LLMs. By leveraging the representations learned during the structured pruning process, the approach can effectively transfer knowledge and improve performance on larger scales. By incorporating these strategies and optimizing the approach for larger models, the LLM-shearing method can be scaled up to produce competitive large-scale LLMs that rival models trained from scratch on massive datasets.