The author introduces the Low-Rank Quantization Error Reduction (LQER) method to enhance post-training quantization of Large Language Models (LLMs) by combining quantization and low-rank approximation. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, enabling nearly-lossless W4A8 quantization on various LLMs and downstream tasks efficiently.
Low-Rank Quantization Error Reduction (LQER) enhances large language models' accessibility by combining quantization and low-rank approximation.