FlattenQuant introduces a method to achieve low-bit per-tensor quantization for large language models, addressing compute-bound challenges and reducing memory consumption. The approach significantly improves inference speed and efficiency with minimal accuracy loss.
The authors propose INTACTKV to preserve the KV cache of pivot tokens losslessly from the full-precision model, reducing quantization error and improving performance. This approach enhances quantized LLMs by focusing on outliers in attention scores over pivot tokens.