Główne pojęcia
This research paper introduces a novel post-training pruning method for foundation models that outperforms existing techniques by directly formulating and solving the Multiple Removal Problem (MRP), enabling simultaneous pruning of multiple weights and achieving higher accuracy without retraining.
Statystyki
The proposed method achieves 4.278 perplexity on WikiText2 compared to 5.698 from SparseGPT for LLaMA2-70B under 2:4 sparsity.
The average accuracy on zero-shot datasets for Mamba-790M with 50% sparsity is 51.095% using the proposed method, compared to 50.555% with SparseGPT.
Cytaty
"The traditional pruning techniques, which fine-tune or retrain models on full datasets for many epochs (i.e., pruning-aware training), are too expensive for LLMs in terms of data and GPU resources."
"Different from the SRP-based SparseGPT, we directly formulate the MRP for layer-wise LLM pruning to simultaneously prune multiple weights in LLMs."
"Our comprehensive experiments across various LLM families (based on transformers and Mamba), model sizes, and datasets demonstrate our superior performance compared with the optimization-based SparseGPT and other heuristic SOTA baselines."