This paper introduces a novel pruning technique called Blockwise Parameter-Efficient Sparsity Allocation (BESA) for compressing large language models (LLMs). The key insights are:
BESA operates under a blockwise pruning framework, which minimizes the block-wise reconstruction error instead of the typical layer-wise pruning error. This helps mitigate the accumulation of pruning error across layers.
BESA employs a parameter-efficient sparsity learning algorithm to optimize the pruning rate for each layer. It represents the sparsity as a differentiable combination of learnable pruning probabilities, enabling efficient and effective pruning of LLMs.
BESA can be jointly optimized with weight quantization techniques, further enhancing the compression ratio while maintaining model performance.
Experiments show that BESA achieves state-of-the-art performance in pruning various LLMs, such as LLaMA and LLaMA2, with up to 50% sparsity. It can prune a 70B parameter LLM in just 5 hours on a single A100 GPU, outperforming prior methods like SparseGPT and Wanda in terms of both perplexity and zero-shot capabilities. The paper also demonstrates the practical speedup of the pruned models using a hardware simulator.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Peng Xu,Wenq... ב- arxiv.org 04-22-2024
https://arxiv.org/pdf/2402.16880.pdfשאלות מעמיקות