Core Concepts
ALTO optimizes compound AI systems by streaming partial outputs, improving throughput and reducing latency.
Abstract
ALTO is a network orchestrator for compound AI systems, focusing on generative language models.
It streams intermediate outputs to enhance throughput and reduce latency.
Challenges of correctness and load balancing arise when streaming data across distributed pipeline stages.
ALTO addresses these challenges with aggregation-aware routing and distributed prompt-aware scheduling.
Experimental results show significant performance improvements in a chatbot verification pipeline.
Stats
"increasing throughput by up to 3× for a fixed latency target of 4 seconds / request"
"reducing tail latency by 1.8× compared to a baseline serving approach"
Quotes
"ALTO achieves high throughput and low latency by taking advantage of an optimization opportunity specific to generative language models."
"Streaming partial outputs between distributed stages can reduce serving latency and increase throughput."