The paper proposes the AQLM algorithm, which extends the classic Additive Quantization (AQ) approach to enable extreme compression of large language models (LLMs). The key innovations are:
The authors evaluate AQLM on compressing accurate open LLMs from the LLAMA 2 family, achieving state-of-the-art results across the 2-4 bit compression range. Specifically, AQLM significantly outperforms prior methods in the extreme 2-bit quantization regime, while also being practical - the authors provide fast GPU and CPU implementations that can match or outperform optimized FP16 inference speed.
The paper first provides background on LLM quantization and multi-codebook quantization techniques like Additive Quantization. It then details the AQLM algorithm, which has three phases:
The experimental evaluation shows that AQLM is the first algorithm to achieve Pareto-optimality at less than 3 bits per parameter, outperforming prior methods across a range of metrics including perplexity and zero-shot task accuracy. The authors also provide extensive ablations and analysis of the algorithm's components and hyperparameters.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies