Core Concepts
A speculative shortest-job-first (SSJF) scheduler that uses a lightweight proxy model to predict LLM output sequence lengths can reduce average job completion times by 30.5–39.6% and increase throughput by 2.2–3.6× compared to first-come-first-serve schedulers.
Abstract
The paper presents a speculative request scheduler called SSJF that addresses the non-deterministic nature of generative large language models (LLMs) to enable efficient interactive LLM serving.
Key highlights:
LLMs have unpredictable execution times due to their autoregressive nature, posing challenges for efficient serving in interactive AI applications.
Existing LLM serving systems use first-come-first-serve (FCFS) scheduling, which suffers from head-of-line blocking issues.
SSJF uses a lightweight proxy model (fine-tuned BERT-base) to predict the LLM output sequence lengths and schedules requests with shorter predicted lengths first.
SSJF supports various batching modes (no batching, dynamic batching, continuous batching) without requiring changes to memory management or batching strategies.
Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5–39.6% and increases throughput by 2.2–3.6× compared to FCFS schedulers.
The proxy model introduces negligible overhead (7.6 ms on average) compared to the LLM execution time (9.8 s on average).
The paper also discusses potential use cases of proxy models in LLM serving beyond request scheduling, such as memory management and caching.
Stats
The average LLM execution time is 9.8 s.
The average proxy model inference latency is 7.6 ms, which is 0.02% of the total latency.
The maximum proxy model inference latency is 20.2 ms, which is less than the minimum LLM execution time of 120 ms.