Outlier-Free 4-Bit Inference in Rotated Large Language Models
QuaRot is a new quantization scheme that rotates large language models to remove outliers from the hidden state, enabling end-to-end 4-bit quantization of weights, activations, and KV cache without any high precision channels.