Core Concepts
LQER combines quantization and low-rank approximation to recover model capability efficiently.
Abstract
Introduces LQER for post-training quantization of Large Language Models (LLMs).
LQER leverages activation-induced scale matrix for nearly-lossless W4A8 quantization.
Achieves near-lossless performance on downstream tasks with fewer hardware resources.
Proposes L2QER for efficient quantization without iterative optimization.
Discusses the computation pattern and benefits of LQER and L2QER.
Compares LQER and L2QER with existing quantization methods.
Evaluates the performance of L2QER on different model families.
Highlights the efficiency and effectiveness of LQER and L2QER in reducing quantization error.
Stats
LQER는 quantization과 low-rank approximation을 결합하여 모델 능력을 효율적으로 회복합니다.
LQER는 activation-induced scale matrix를 활용하여 거의 손실이 없는 W4A8 양자화를 달성합니다.
LQER는 적은 하드웨어 자원을 사용하여 하류 작업에서 거의 손실이 없는 성능을 달성합니다.
L2QER를 제안하여 반복적 최적화 없이 효율적인 양자화를 실현합니다.
Quotes
"LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution."
"Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36× fewer hardware resources than the leading state-of-the-art method."