This survey presents a thorough examination of low-bit quantization for large language models (LLMs). It begins by introducing the basics of quantization, including low-bit number formats, quantization granularity, and dynamic vs. static quantization strategies.
The paper then reviews the various inference frameworks and systems that support quantized LLMs across different hardware platforms, highlighting the algorithms, bitwidth support, target devices, and model families integrated into these frameworks.
Next, the authors delve into the algorithmic approaches for efficient training and inference of quantized LLMs. For training, they discuss methods for low-bit training and parameter-efficient fine-tuning. For inference, the survey covers quantization-aware training and post-training quantization techniques, including equivalent transformation, compensation, mixed precision, and combinations with other compression methods.
The authors also summarize the key quantization toolkits and benchmarks that facilitate the development of accurate low-bit LLMs.
Finally, the survey explores future trends and potential advancements in the field of LLM quantization, discussing emerging research areas, breakthroughs, and the impact of new technologies.
Overall, this comprehensive survey provides valuable insights and guidelines for researchers and developers to enhance the efficiency and applicability of LLMs through low-bit quantization.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Ruihao Gong,... at arxiv.org 09-26-2024
https://arxiv.org/pdf/2409.16694.pdfDeeper Inquiries