Основные понятия
QuaRot is a new quantization scheme that rotates large language models to remove outliers from the hidden state, enabling end-to-end 4-bit quantization of weights, activations, and KV cache without any high precision channels.
Аннотация
The paper introduces QuaRot, a new quantization scheme for large language models (LLMs) that enables end-to-end 4-bit quantization of all weights, activations, and KV cache.
Key highlights:
- QuaRot applies randomized Hadamard transformations to the weight matrices of the LLM, which removes outliers from the hidden state without changing the model output. This makes quantization easier.
- The Hadamard transformations are applied to the hidden state (residual) of the LLM, the activations of the feed-forward components, aspects of the attention mechanism, and the KV cache.
- This allows all matrix multiplications to be performed in 4-bits, without any channels identified for retention in higher precision.
- On the LLAMA2-70B model, QuaRot achieves up to 2.16x prefill speedups, 3.39x memory savings during decoding, and at most 0.29 WikiText-2 perplexity loss, while retaining 99% of the zero-shot performance.
- QuaRot also shows promising results for 6-bit and 8-bit quantization, which are lossless.
Статистика
The paper does not provide specific numerical data points to support the key claims. The main quantitative results are presented in the form of tables showing the performance of QuaRot on various metrics.
Цитаты
The paper does not contain any striking quotes that support the key logics.