Core Concepts
An accurate and efficient low-bitwidth post-training quantization method, QLLM, is proposed to address the challenge of activation outliers in quantizing large language models.
Abstract
The paper presents QLLM, an accurate and efficient low-bitwidth post-training quantization (PTQ) method designed for large language models (LLMs).
Key highlights:
- LLMs have high computational demands and memory overheads, hindering their broad deployment. Quantization is a promising solution, but existing PTQ methods suffer from significant performance degradation at low bitwidths due to activation outliers.
- QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outlier channels across other channels, mitigating their impact on the quantization range.
- QLLM also proposes an efficient gradient-based error correction mechanism that learns a small set of low-rank weights to further compensate for the performance loss caused by quantization.
- Extensive experiments on LLaMA-1 and LLaMA-2 models show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours, outperforming previous state-of-the-art methods by 7.89% on the average accuracy across five zero-shot tasks.
Stats
LLMs like GPT-3 and LLaMA contain billions of parameters, requiring at least 325GB of memory for storage in half-precision (FP16) format.
Existing PTQ methods suffer from significant performance degradation at low bitwidths due to activation outliers.
QLLM quantizes 4-bit LLaMA-2-70B within 10 hours, outperforming previous state-of-the-art methods by 7.89% on the average accuracy across five zero-shot tasks.
Quotes
"Recent studies (Dettmers et al., 2022; Xiao et al., 2023; Wei et al., 2023) have revealed a unique pattern in LLMs' activations that is they contain specific outlier channels with significantly large magnitudes."
"To compensate for the performance drop of quantization, a widely adopted PTQ strategy (Wei et al., 2023; Shao et al., 2023; Yao et al., 2022) further proposes to tune the quantized LLM directly by minimizing the block-wise reconstruction error."