Core Concepts
Data-driven compression framework for large language models.
Abstract
Introduction to the challenges of deploying large language models due to their size.
Exploration of data-driven compression as an alternative to training smaller models.
Framework for unstructured, semi-structured, and structured pruning of LLMs.
Comparison with existing methods like Optimal Brain Damage and Optimal Brain Surgeon.
Multi-shot pruning schedule and interleaved low-rank first-order corrections.
Results showing state-of-the-art performance in structured, semi-structured, and unstructured compression.
Stats
Structured compression (rows and columns)
Unstructured compression (matrix elements)
1.3b
2.7b
6.7b
Quotes
"The superior performance of LLM Surgeon is achieved by scaling up the block-diagonal Kronecker-factorized approximations to the empirical Fisher from Eigendamage to LLMs."
"Our method gives the first practically usable results for structured pruning of LLMs – they can be pruned by up to 30% with minor performance degradation."