A novel layer-wise sparsity scheduler that utilizes pruning error estimation based on the inverse of the Hessian matrix to achieve high sparsity levels (>0.7) in extremely large language models while maintaining reasonable perplexity.
Compression techniques like Magnitude Pruning, SparseGPT, and Wanda can significantly reduce the size of large language models, but their impact on downstream task performance varies. While these methods can maintain perplexity, they exhibit substantial degradation in instruction-following capabilities, highlighting the limitations of perplexity as the sole evaluation metric. Jensen-Shannon Divergence is proposed as a more comprehensive metric to capture the nuanced changes in model behavior post-compression.
The AQLM algorithm extends Additive Quantization (AQ) to enable extreme compression of large language models, achieving state-of-the-art accuracy at 2-3 bits per parameter.