Activation-aware Weight Quantization (AWQ) is a hardware-friendly approach for low-bit weight-only quantization of large language models (LLMs) that protects the most salient weights to significantly reduce quantization error without relying on backpropagation or reconstruction.
Quantization is a crucial technique for making large language models more efficient and deployable across diverse hardware platforms by reducing their memory footprint while maintaining similar performance levels.
FLUTE, a novel CUDA kernel, significantly accelerates LLM inference by enabling fast matrix multiplications for lookup table-quantized models, particularly excelling in low-bit and non-uniform quantization scenarios.
PrefixQuant, a new static quantization technique for Large Language Models (LLMs), surpasses the performance of dynamic quantization by strategically prefixing outlier tokens in the KV cache, leading to enhanced efficiency and accuracy.
SpinQuant is a novel method that leverages learned rotation matrices to minimize quantization errors in Large Language Models (LLMs), leading to significant improvements in accuracy and efficiency for low-bit quantization.
SpinQuant는 LLM의 가중치와 활성화에 학습된 회전을 적용하여 아웃라이어를 줄이고, 이를 통해 4-bit 양자화에서도 full precision 모델에 가까운 성능을 달성하는 새로운 양자화 기술입니다.