Enabling Efficient Long-Context Inference for Large Language Models through Accurate Key-Value Cache Quantization
Enabling efficient long-context inference for large language models through accurate Key-Value cache quantization, including novel methods such as per-channel Key quantization, pre-RoPE Key quantization, sensitivity-weighted non-uniform quantization, and per-vector dense-and-sparse quantization.