The LLaMA3-70B model series exhibits a unique vulnerability to per-channel 8-bit quantization, in contrast to other large language models that demonstrate robust performance under the same quantization scheme.
This survey provides a comprehensive overview of low-bit quantization methods for large language models, covering the fundamental principles, system implementations, and algorithmic approaches to enhance the efficiency and deployability of LLMs.
The proposed DuQuant method effectively mitigates the impact of both massive and normal outlier activations in large language models through strategic rotation and permutation transformations, leading to substantial performance improvements in low-bit quantization scenarios.
APTQ (Attention-aware Post-Training Mixed-Precision Quantization) is a novel technique that leverages the nonlinear effects of attention outputs and second-order Hessian information to achieve high-quality quantization of large language models, enabling efficient deployment on edge devices.
Kashin Quantization, a novel data quantization approach, can efficiently compress large language models while maintaining competitive or superior predictive performance.
QuaRot is a new quantization scheme that rotates large language models to remove outliers from the hidden state, enabling end-to-end 4-bit quantization of weights, activations, and KV cache without any high precision channels.