Core Concepts

This paper presents a framework for compressed, variable precision, variable range, user-defined numerical data types that provide flexibility in the computation of neural networks, enabling significant bandwidth reduction and storage savings for the weights of large language models and convolutional neural networks.

Abstract

The paper starts by observing that the weight distributions of large language models like Llama2 7B exhibit a pattern of a large number of small weights compared to a few large ones. This suggests that the weights are highly compressible using entropy coding techniques.
The paper then introduces a simple lossless compression algorithm based on this observation, which can achieve around 1.5:1 compression for the Llama2 7B weights. This algorithm encodes the exponent and mantissa of the bfloat16 weights separately, with the exponent encoded using a small number of bits and the mantissa entropy coded.
The paper then extends this concept to create a more general framework for variable precision, variable range, compressed numerical data types. These "coding pairs" consist of a code, which is entropy coded, and additional data of variable size. This allows the representation of a wide range of data types, including custom floating-point formats, posits, and quantized integers.
The paper discusses a hardware implementation of this framework using Asymmetric Numeral Systems (ANS), which can compress and decompress a coding pair in a single clock cycle. It also describes how this compressed data format can be efficiently interfaced with both floating-point and fixed-point computational engines.
Finally, the paper presents an example of using this compression technique for weight sharing in a "token factory" architecture, where multiple instances of the same large language model can share the compressed weight data, significantly reducing the required memory bandwidth.

Stats

The weights of the Llama2 7B model exhibit a pattern of a large number of small weights compared to a few large ones.
The simple lossless compression algorithm can achieve around 1.5:1 compression for the Llama2 7B weights.
The ANS-based hardware implementation can process over 800 million bfloat16 numbers per second.

Quotes

"This paper attempts to address and reconcile two different issues: the existence of multiple numerical data formats (such as int8, bfloat16, fp8, etc., often non optimal for the application and not directly compatible with one another) and the necessity to reduce their bandwidth requirements, especially in the case of power hungry and slow DRAM."
"One pattern that should immediately stand out is the large amount a small weights compared to a few large ones (in absolute value). This pattern indicates that the weights are highly compressible."
"Except for the reduced bandwidth and storage space, this is analogous to memory interface that fetches multiple streams of data to feed computational structures such as tensor cores and systolic arrays. The only difference is that the FIFOs and the memory will contain compressed data."

Key Insights Distilled From

by Vincenzo Lig... at **arxiv.org** 04-18-2024

Deeper Inquiries

To extend the compression framework to support adaptive compression of activations with changing statistical properties during runtime, a dynamic approach to collecting and updating code frequencies is essential. This can be achieved by implementing a mechanism that continuously monitors the activations being processed and adjusts the probabilities for the codes accordingly. By dynamically updating the probabilities based on the changing statistical properties of the activations, the compressor can adapt to the varying data distribution in real-time. This adaptive compression strategy would involve collecting statistics on the fly, generating normalized probabilities, and reprogramming the compressor to reflect the current data characteristics. While this dynamic approach adds complexity and overhead to the compression process, it enables efficient encoding of activations with fluctuating statistical properties.

Applying the compression technique to the weights and activations of other types of neural networks, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), presents both challenges and trade-offs.
Challenges:
Data Distribution: CNNs and RNNs may have different weight and activation distributions compared to large language models (LLMs), requiring tailored compression strategies.
Complexity: CNNs involve 3D weight tensors and spatial correlations, while RNNs have sequential dependencies, posing challenges for compression algorithms.
Adaptability: Ensuring the compression framework can adapt to the unique characteristics of each network type without sacrificing performance.
Trade-offs:
Resource Utilization: Different network architectures may require varying levels of resources for compression, impacting hardware implementation.
Compression Ratios: The effectiveness of the compression technique may vary based on the data distribution in CNNs and RNNs, leading to trade-offs between compression ratios and accuracy.
Performance: Balancing the trade-off between compressed data size and inference speed, especially in real-time applications where latency is critical.
By addressing these challenges and trade-offs through customized compression algorithms and hardware implementations, the framework can be extended to effectively compress weights and activations in diverse neural network architectures.

Integrating the compression framework with emerging hardware architectures like in-memory computing or neuromorphic computing can further optimize the performance and energy efficiency of large language models (LLMs) and other AI applications.
In-Memory Computing:
Efficient Data Access: Leveraging in-memory computing's ability to perform computations within memory units can enhance the speed of compression and decompression operations.
Reduced Data Movement: By processing compressed data directly within memory, the need for frequent data transfers between memory and processing units is minimized, improving overall system efficiency.
Parallel Processing: In-memory computing architectures can support parallel processing of compressed data streams, enabling faster compression and decompression of weights and activations.
Neuromorphic Computing:
Spiking Neural Networks: Neuromorphic computing mimics the brain's neural structure, offering potential for efficient encoding and decoding of compressed data using spiking neural networks.
Event-Driven Processing: Neuromorphic hardware's event-driven processing can be leveraged for real-time adaptive compression of activations based on changing statistical properties.
Low-Power Operation: Neuromorphic architectures are inherently energy-efficient, reducing power consumption during compression and decompression tasks for neural networks.
By integrating the compression framework with these advanced hardware architectures, the overall efficiency, speed, and energy consumption of processing large neural network models can be significantly optimized.

0