Enabling efficient long-context inference for large language models through accurate Key-Value cache quantization, including novel methods such as per-channel Key quantization, pre-RoPE Key quantization, sensitivity-weighted non-uniform quantization, and per-vector dense-and-sparse quantization.
REST, a novel algorithm, leverages retrieval from a datastore to generate draft tokens, enabling significant speedup in language model generation without requiring additional training.
DistillSpec, a knowledge distillation method, improves the alignment between a small draft model and a large target model to enhance the speed of speculative decoding without compromising performance.