Language model inference

insight - Language model inference

Enabling Efficient Long-Context Inference for Large Language Models through Accurate Key-Value Cache Quantization

Enabling efficient long-context inference for large language models through accurate Key-Value cache quantization, including novel methods such as per-channel Key quantization, pre-RoPE Key quantization, sensitivity-weighted non-uniform quantization, and per-vector dense-and-sparse quantization.

Retrieval-Based Speculative Decoding: A Novel Approach to Accelerate Language Model Generation

REST, a novel algorithm, leverages retrieval from a datastore to generate draft tokens, enabling significant speedup in language model generation without requiring additional training.

Enhancing Speculative Decoding via Knowledge Distillation: DistillSpec Improves Alignment Between Draft and Target Language Models

DistillSpec, a knowledge distillation method, improves the alignment between a small draft model and a large target model to enhance the speed of speculative decoding without compromising performance.

About

Products | Resources

Insights