ALTO is designed to optimize the serving of compound AI systems, particularly those involving generative language models. By streaming partial outputs between distributed pipeline stages, ALTO addresses challenges related to correctness and load balancing. The system introduces aggregation-aware routing and distributed prompt-aware scheduling to enhance performance. Experimental results demonstrate significant improvements in throughput and latency for complex chatbot verification pipelines.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Keshav Santh... lúc arxiv.org 03-08-2024
https://arxiv.org/pdf/2403.04311.pdfYêu cầu sâu hơn