MuxServe, a flexible spatial-temporal multiplexing system, colocates large language models considering their popularity and flexibly colocates prefill and decoding jobs to improve GPU utilization and serve multiple LLMs efficiently.
SpecInfer accelerates large language model serving by leveraging tree-based speculative inference and verification, which significantly reduces memory accesses to the LLM's parameters and end-to-end inference latency while preserving the same generative performance as incremental decoding.
AttentionStore, a new attention mechanism, enables the reuse of key-value caches across multi-turn conversations, significantly reducing the repetitive computation overheads and improving the inference performance and cost-efficiency of large language models.