Activation-aware Weight Quantization (AWQ) is a hardware-friendly approach for low-bit weight-only quantization of large language models (LLMs) that protects the most salient weights to significantly reduce quantization error without relying on backpropagation or reconstruction.
Quantization is a crucial technique for making large language models more efficient and deployable across diverse hardware platforms by reducing their memory footprint while maintaining similar performance levels.