核心概念
The AQLM algorithm extends Additive Quantization (AQ) to enable extreme compression of large language models, achieving state-of-the-art accuracy at 2-3 bits per parameter.
摘要
The paper proposes the AQLM algorithm, which extends the classic Additive Quantization (AQ) approach to enable extreme compression of large language models (LLMs). The key innovations are:
- Adapting the AQ optimization problem to be instance-aware, taking into account the layer input and output activations during quantization.
- Complementing the layer-wise optimization with an efficient intra-block tuning technique, which jointly optimizes quantization parameters across several layers using the calibration data.
The authors evaluate AQLM on compressing accurate open LLMs from the LLAMA 2 family, achieving state-of-the-art results across the 2-4 bit compression range. Specifically, AQLM significantly outperforms prior methods in the extreme 2-bit quantization regime, while also being practical - the authors provide fast GPU and CPU implementations that can match or outperform optimized FP16 inference speed.
The paper first provides background on LLM quantization and multi-codebook quantization techniques like Additive Quantization. It then details the AQLM algorithm, which has three phases:
- Beam search to optimize the discrete codes
- Codebook update via gradient descent
- Fine-tuning of the codebooks and non-linear layers across transformer blocks
The experimental evaluation shows that AQLM is the first algorithm to achieve Pareto-optimality at less than 3 bits per parameter, outperforming prior methods across a range of metrics including perplexity and zero-shot task accuracy. The authors also provide extensive ablations and analysis of the algorithm's components and hyperparameters.
统计
The paper reports the following key metrics:
Perplexity on WikiText-2 and C4 validation sets
Accuracy on zero-shot tasks including WinoGrande, PiQA, HellaSwag, ARC-easy, and ARC-challenge
引用
"AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime."
"AQLM can match or even outperform the floating point baseline in terms of speed, while reducing the memory footprint by up to 8x."