Core Concepts
The author proposes a method based on Hessian sensitivity-aware mixed sparsity pruning to efficiently reduce the size of Large Language Models (LLMs) without retraining, achieving at least 50% sparsity. The approach allocates sparsity adaptively based on sensitivity, maintaining overall sparsity levels effectively.
Abstract
The content discusses the challenges posed by the large sizes of Large Language Models (LLMs) from the GPT family and introduces a method for efficient model compression through mixed sparsity pruning. By combining saliency criteria and sensitivity-aware pruning, the proposed approach demonstrates improved performance in reducing errors introduced by pruning while maintaining high compression ratios. The method is compatible with quantization and achieves state-of-the-art results in LLM pruning.
Various mainstream model compression techniques are explored, including knowledge distillation, model quantization, and model sparsity pruning. The study delves into the effectiveness of different methods such as OBS and OBD saliency criteria, highlighting the importance of error compensation in weight adjustment during pruning.
The experiments conducted evaluate the proposed method's performance on different LLM models like LLaMA-7B, LLaMA2-7B, and Baichuan-7B across various datasets. Results show that the mixed sparsity approach outperforms SparseGPT in terms of perplexity reduction and zero-shot downstream task accuracy. Additionally, joint mixed sparsity pruning and quantization further enhance compression ratios while minimizing performance degradation.
Overall, the content presents a comprehensive framework for efficient model compression in Large Language Models through sensitivity-aware mixed sparsity pruning based on Hessian information.
Stats
Various Large Language Models (LLMs) have achieved outstanding performances.
Proposed method prunes LLMs to at least 50% sparsity without retraining.
Model quantization involves replacing high-precision floating-point parameters with lower precision.
Second-order derivatives are used for network pruning.
Weight-level mixed sparsity allocation is implemented.
Sensitivity level is derived from the trace of the Hessian matrix.
Different models exhibit varying sensitivities across layers.
Quotes
"The advantages of the proposed method exhibit even more when the sparsity is extremely high."
"Our approach beats SparseGPT in terms of both perplexity and zero-shot downstream task performances."
"Our method further reduces performance degradation caused by pruning."