Core Concepts
SVD-LLM introduces a novel truncation-aware data whitening strategy and layer-wise closed-form model parameter update to enhance LLM compression performance.
Abstract
SVD-LLM proposes innovative techniques to address limitations in existing SVD-based LLM compression methods. It demonstrates superior performance across various datasets and models, especially at high compression ratios. The method not only compresses LLMs effectively but also reduces the footprint of KV cache during inference at runtime.
Stats
"Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios."
"Specifically, when compressing LLaMA-7B under 20% compression ratio on an A100 GPU, ASVD takes about 5.5 hours whereas SVD-LLM completes the compression process in 15 minutes."
"Under 30% compression ratio, compared to the best-performing baseline (ASVD), SVD-LLM reduces the perplexity on WikiText-2, PTB, and C4 by 81%, 62%, and 39%, respectively."