Core Concepts
Recent advancements in model compression and system-level optimization methods aim to enhance the efficiency of Large Language Model (LLM) inference.
Abstract
This survey paper explores existing methods that aim to make LLMs more efficient through model compression as well as through system-level optimizations. The authors provide an empirical analysis of well-known compression methods for LLaMA(/2)-7B models under a standardized setup, offering practical insights for efficient LLM deployment.
The key highlights and insights from the survey are:
Model Compression Techniques:
Architecture Pruning: Methods like LLM-Pruner, LoRAPrune, and FLaP demonstrate the potential of structured pruning to achieve compression while maintaining performance.
Quantization: Techniques like LLM.int8(), SmoothQuant, QLoRA, and GPTQ enable significant model size reduction with minimal loss in accuracy.
Knowledge Distillation: Approaches like Generalized KD, TED, and DISCO leverage knowledge transfer from larger teacher models to train smaller student models.
Low-rank Approximations: Methods like TensorGPT and LoSparse explore the use of low-rank decomposition to compress LLMs.
System-level Optimizations:
Techniques like Paged Attention, Tensor/Pipeline Parallelism, CPU/GPU Offloading, and Fused Operations improve the runtime efficiency of LLMs.
Implementations such as vLLM, Llama.cpp, ExLlama, TensorRT-LLM, and MLC-LLM demonstrate the benefits of these system-level optimizations.
Empirical Analysis:
The experiments on LLaMA(/2)-7B highlight the effectiveness of various compression techniques in terms of weight memory, runtime memory, inference speed, and perplexity.
FLaP and fine-tuned LLM-Pruner emerge as promising structured pruning methods, while quantization techniques like OmniQuant and GPTQ demonstrate strong performance.
System-level optimizations, particularly TensorRT-LLM, provide significant speedups in inference.
Challenges and Future Directions:
The survey identifies key challenges, such as the computational intensity of large-scale pruning/distillation, the overhead of on-the-fly quantization-dequantization, and the difficulty in rank selection for low-rank approximations.
Potential solutions include exploring training-free pruning methods, localized distillation, growing smaller models, and developing more efficient quantization techniques.
Overall, this survey provides a comprehensive overview of the state-of-the-art in LLM compression and optimization, offering practical insights and highlighting promising research directions to achieve efficient deployment of large language models.
Stats
LLaMA-7B model requires 140GB of VRAM excluding the memory required for model inferencing.
LLM-Pruner at 50% sparsity achieves a perplexity of 112.44 on WikiText-2 dataset.
FLaP at 50% sparsity achieves a perplexity of 31.80 on WikiText-2 dataset.
GPTQ 4-bit quantization achieves a perplexity of 6.08 on WikiText-2 dataset.
TensorRT-LLM with GPTQ 4-bit quantization achieves a token rate of 202.16 tokens/sec on a NVIDIA GPU.
Quotes
"Despite their unparalleled performance, widespread adoption of LLMs is hindered by their substantial computational and memory requirements, which pose challenges for deployment in resource-constrained environments."
"To further push the frontiers of research towards practical inference improvement for LLMs, a comprehensive study is still missing."
"Drawing upon insights derived from our survey and empirical analysis, we systematically pinpoint existing limitations and propose viable pathways forward for achieving optimal efficiency in LLM inference."