Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
A speculative shortest-job-first (SSJF) scheduler that uses a lightweight proxy model to predict LLM output sequence lengths can reduce average job completion times by 30.5–39.6% and increase throughput by 2.2–3.6× compared to first-come-first-serve schedulers.