The paper proposes a novel approach to aggressively compress extremely large language models (LLMs) by developing a layer-wise sparsity scheduler. The key contributions are:
Providing a formal explanation for the effectiveness of the "sequentially pruning all" assumption in precomputing the inverse of Hessian matrices, which is crucial for accelerating the pruning process.
Deriving an estimation of layer-wise pruning loss based on the inverse of the Hessian matrix, which is used to guide the sparsity allocation across different layers.
Employing a log-level clustering of the estimated errors to effectively control the sparsity distribution and perplexity, achieving high sparsity levels (>0.7) with reasonable perplexity results.
Demonstrating the effectiveness of the proposed method through experiments on large language models like OPT-66B and BLOOM-176B, consistently outperforming the state-of-the-art SparseGPT technique.
Showing the compatibility of the method with quantization techniques that convert FP16 weights to INT4, enabling additional compression of LLMs.
The paper addresses the challenge of deploying extremely large language models on personal computers and mobile devices by developing an efficient post-training compression approach that maintains model performance while significantly reducing the model size and complexity.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Zining Zhang... at arxiv.org 10-01-2024
https://arxiv.org/pdf/2409.20094.pdfDeeper Inquiries