Core Concepts
State-of-the-art language models are compressed using data-driven methods to enable efficient deployment.
Abstract
The article introduces the LLM Surgeon framework for compressing large language models through data-driven methods. It explores compression alternatives to training smaller models from scratch, focusing on structured, semi-structured, and unstructured pruning. The method improves weight updates by considering correlations between weights and achieves state-of-the-art results in pruning large language models. Key highlights include:
Introduction of LLM Surgeon framework for compression.
Exploration of structured, semi-structured, and unstructured pruning methods.
Improvement in weight updates by considering correlations between weights.
Achieving state-of-the-art results in large language model pruning.
Stats
State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data.
We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights.
Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance.