The paper proposes the AQLM algorithm, which extends the classic Additive Quantization (AQ) approach to enable extreme compression of large language models (LLMs). The key innovations are:
The authors evaluate AQLM on compressing accurate open LLMs from the LLAMA 2 family, achieving state-of-the-art results across the 2-4 bit compression range. Specifically, AQLM significantly outperforms prior methods in the extreme 2-bit quantization regime, while also being practical - the authors provide fast GPU and CPU implementations that can match or outperform optimized FP16 inference speed.
The paper first provides background on LLM quantization and multi-codebook quantization techniques like Additive Quantization. It then details the AQLM algorithm, which has three phases:
The experimental evaluation shows that AQLM is the first algorithm to achieve Pareto-optimality at less than 3 bits per parameter, outperforming prior methods across a range of metrics including perplexity and zero-shot task accuracy. The authors also provide extensive ablations and analysis of the algorithm's components and hyperparameters.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Vage Egiazar... om arxiv.org 09-12-2024
https://arxiv.org/pdf/2401.06118.pdfDiepere vragen