Core Concepts
CBQ introduces a cross-block reconstruction method for large language models, achieving superior low-bit quantization and outperforming existing methods.
Abstract
The article introduces CBQ, a post-training quantization method for large language models. It addresses the limitations of existing methods by incorporating cross-block reconstruction, outlier suppression, and adaptive rounding. Extensive experiments show superior performance in low-bit quantization settings and across various models and datasets.
Introduction
Large language models have sparked interest due to their performance in natural language tasks.
Model compression techniques like post-training quantization are essential for deployment.
Data Extraction
"CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU."
"CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods."
Quotations
"CBQ introduces a cross-block dependency using a homologous reconstruction scheme."
"CBQ achieves superior low-bit quantization and outperforms existing state-of-the-art methods."
Stats
CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU.
CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods.
Quotes
"CBQ introduces a cross-block dependency using a homologous reconstruction scheme."
"CBQ achieves superior low-bit quantization and outperforms existing state-of-the-art methods."