One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
The author proposes a method based on Hessian sensitivity-aware mixed sparsity pruning to efficiently reduce the size of Large Language Models (LLMs) without retraining, achieving at least 50% sparsity. The approach allocates sparsity adaptively based on sensitivity, maintaining overall sparsity levels effectively.