Temel Kavramlar
Reevaluating SoTA Compression Methods for LLMs Beyond Perplexity Metrics.
Özet
The content discusses the challenges and successes of compressing Large Language Models (LLMs) through pruning and quantization methods. It introduces the Knowledge-Intensive Compressed LLM Benchmark (LLM-KICK) to redefine evaluation protocols. The study reveals insights on compression methods' effectiveness, performance degradation, and capabilities in language understanding, reasoning, generation, retrieval, and summarization tasks. Various datasets like FreebaseQA, MMLU benchmark, TriviaQA, and CNN/DailyMail are used for evaluation across different task settings. Results show the impact of compression on knowledge retention, performance drop at varying sparsity levels, and the importance of calibration samples in preserving performance. The study also compares large-sparse models with small-dense ones and explores the role of calibration data in improving compression algorithms.
İstatistikler
Recently shown success in training-free and data-free compression of LLMs achieving 50-60% sparsity.
Pruning methods suffer significant performance degradation even at trivial sparsity ratios like 25-30%.
Quantization methods are more successful than pruning.
Compressed LLMs fail to generate knowledge-enriched answers despite being fluent.
Pruned LLMs remain robust in context retrieval systems even at ≥50% sparsity.
Alıntılar
"Perplexity has been widely questioned as an unsatisfactory measure to compare the true potential of LLMs."
"Compression significantly impacts the knowledge encoded in LLMs during pre-training."
"Calibration sample count plays a vital role in preserving SparseGPT's performance during compression."