Large language model serving

insight - Large language model serving

Flexible Spatial-Temporal Multiplexing for Efficient Serving of Multiple Large Language Models

MuxServe, a flexible spatial-temporal multiplexing system, colocates large language models considering their popularity and flexibly colocates prefill and decoding jobs to improve GPU utilization and serve multiple LLMs efficiently.

SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

SpecInfer accelerates large language model serving by leveraging tree-based speculative inference and verification, which significantly reduces memory accesses to the LLM's parameters and end-to-end inference latency while preserving the same generative performance as incremental decoding.

Efficient Attention Reuse for Cost-effective Multi-turn Conversation Inference in Large Language Models

AttentionStore, a new attention mechanism, enables the reuse of key-value caches across multi-turn conversations, significantly reducing the repetitive computation overheads and improving the inference performance and cost-efficiency of large language models.

About

Products | Resources

Insights