The paper proposes Activation-aware Weight Quantization (AWQ), a novel method for low-bit weight-only quantization of large language models (LLMs). The key insight is that weights are not equally important - protecting only 1% of the most salient weights can greatly reduce quantization error. AWQ determines the salient weights by observing the activation distribution, not the weight distribution. It then performs per-channel scaling to protect the salient weights and reduce quantization error, without relying on any backpropagation or reconstruction.
To implement AWQ, the authors developed TinyChat, an efficient inference framework that translates the memory savings from 4-bit quantization into measured speedup, achieving over 3x speedup compared to FP16 on various LLMs across desktop, laptop and mobile GPUs. TinyChat employs techniques like on-the-fly weight dequantization, SIMD-aware weight packing, and kernel fusion to minimize the inference overhead.
Experiments show that AWQ outperforms existing quantization methods on various language modeling and domain-specific benchmarks, including instruction-tuned and multi-modal LMs. Thanks to its better generalization, AWQ also enables the deployment of large 70B Llama-2 model on mobile GPUs.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Ji Lin,Jiami... às arxiv.org 04-23-2024
https://arxiv.org/pdf/2306.00978.pdfPerguntas Mais Profundas