Symphony introduces deferred batch scheduling to optimize system efficiency and throughput for DNN model serving. It explores the challenges of traditional model serving systems in achieving optimal batch sizes due to eager dispatching, leading to reduced efficiency and throughput. Symphony's approach focuses on accumulating a larger number of requests to increase batch size and consolidate GPU usage proportionally to the load. The system consists of a centralized scheduler that dynamically schedules batches of inference tasks across multiple GPUs, achieving load-proportional GPU usage and efficient resource allocation. Symphony outperforms existing systems by improving goodput by up to 5x with the same number of GPUs and reducing GPU usage by up to 60% for the same workload.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Lequn Chen,W... lúc arxiv.org 03-01-2024
https://arxiv.org/pdf/2308.07470.pdfYêu cầu sâu hơn