Compression of Large Language Models

Evaluating the Impact of Compression Techniques on the Performance of Large Language Models Across Downstream Tasks

Compression techniques like Magnitude Pruning, SparseGPT, and Wanda can significantly reduce the size of large language models, but their impact on downstream task performance varies. While these methods can maintain perplexity, they exhibit substantial degradation in instruction-following capabilities, highlighting the limitations of perplexity as the sole evaluation metric. Jensen-Shannon Divergence is proposed as a more comprehensive metric to capture the nuanced changes in model behavior post-compression.

Efficient Compression of Extremely Large Language Models Through Layer-Wise Sparsity Scheduling

Evaluating the Impact of Compression Techniques on the Performance of Large Language Models Across Downstream Tasks

Extreme Compression of Large Language Models via Additive Quantization