Sign In

Efficient Quantization of Large Language Models with Outliers Using Block Floating Point Formats

Core Concepts
A novel approach to enable usage of low-precision block floating point formats without compromising the resulting model accuracy, by exploiting the common channel-wise patterns exhibited by outliers in weights and activations.
The paper focuses on the problem of efficient inference on extremely large-scale large language models (LLMs). The key challenge is the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement, especially due to the exploding growth in the lengths of the sequences being processed. To address this, the authors propose a novel approach that enables the usage of low-precision block floating point (BFP) formats without compromising the resulting model accuracy. The key observation is that the inner product is invariant to synchronized reshuffling of the tensors being multiplied. The authors exploit the common channel-wise patterns exhibited by the outliers in weights and activations to rearrange them in such a way that their quantization quality is significantly improved. Specifically, the authors sort the rows of the weight matrix Wk by their Euclidean norms before quantization. This ensures that the elements within each block have comparable magnitudes, avoiding the issue of outliers affecting the quantization accuracy of other elements in the same block. To compensate for this reshuffling, the authors also rearrange the columns of the query weight matrix Wq in the same order. This permutation happens at the compile time and has no impact on the inference latency. The authors demonstrate the effectiveness of their approach on the Llama2-7B model, showing that their K-sort algorithm together with BFP12 storage allows for a 2x reduction in the memory footprint of the K-cache without significant degradation of the model's accuracy.
The paper reports the following key figures: Baseline perplexity of Llama2-7B model on wikitext-2 dataset: 9.4881 (in FP16) Perplexity with BFP12 quantization of keys and BFP16 quantization of queries: Block size 128: 10.0861 Block size 64: 9.6061 Block size 32: 9.5196

Key Insights Distilled From

by Nikita Trukh... at 04-01-2024
Accurate Block Quantization in LLMs with Outliers

Deeper Inquiries

How would the proposed K-sort algorithm perform on even larger LLM models, such as GPT-4 or Gemma, in terms of memory savings and accuracy preservation?

The K-sort algorithm proposed in the context for rearranging the rows of the keys matrix (K) in Large Language Models (LLMs) to improve quantization quality and reduce memory footprint would likely perform exceptionally well on even larger models like GPT-4 or Gemma. These larger models typically have increased complexity and require more memory for storage and computation. By sorting the rows of the keys matrix based on their norms, the K-sort algorithm can effectively handle outliers and improve quantization accuracy. This approach would lead to significant memory savings, allowing for longer sequences to be processed on the same hardware without compromising model accuracy. Since the algorithm rearranges the channels at compile time and does not introduce any runtime overhead, it can be seamlessly applied to larger models, resulting in substantial benefits in terms of memory efficiency and inference performance.

What other techniques could be combined with the K-sort approach to further improve the quantization quality and enable even higher compression ratios of the KV-cache?

To enhance the quantization quality and achieve higher compression ratios of the KV-cache in conjunction with the K-sort approach, several techniques can be integrated: Dynamic Quantization: Implementing dynamic quantization techniques that adjust the precision of the elements based on their importance or contribution to the model's performance can further optimize the quantization process. Sparsity Techniques: Leveraging sparsity in the matrices to reduce the number of non-zero elements can lead to more efficient storage and computation, complementing the K-sort algorithm's memory-saving capabilities. Quantization-Aware Training: Utilizing quantization-aware training methods during the model training phase can help optimize the model for low-precision inference, improving the overall quantization quality and enabling higher compression ratios. Weight Sharing: Exploring weight sharing techniques where similar or redundant weights are represented by a single value can lead to additional compression of the KV-cache, reducing memory requirements further. By combining these techniques with the K-sort algorithm, it is possible to achieve a synergistic effect that maximizes memory savings, improves quantization accuracy, and enables even higher compression ratios of the KV-cache in large language models.

Could the channel-wise sorting idea be extended to the quantization of the values V in the KV-cache, and how would that impact the overall inference performance?

Extending the channel-wise sorting idea to the quantization of the values V in the KV-cache can have a positive impact on the overall inference performance of Large Language Models (LLMs). By sorting the values V based on their characteristics, such as norms or distribution, before quantization, it is possible to enhance the accuracy of the quantization process and reduce the impact of outliers. This approach can lead to more efficient storage and computation of the values V, contributing to improved inference speed and memory utilization. Furthermore, similar to the K-sort algorithm for keys, rearranging the values V in a channel-wise manner can help in achieving higher compression ratios of the KV-cache without compromising the model's accuracy. The compile-time rearrangement of values V based on specific patterns or characteristics can optimize the quantization process and ensure that the inference performance is not adversely affected. In conclusion, extending the channel-wise sorting idea to the quantization of values V in the KV-cache can be a valuable strategy to further enhance the efficiency and accuracy of inference in LLMs, ultimately leading to improved overall performance.