The author argues that current LLM serving systems face a tradeoff between throughput and latency, proposing Sarathi-Serve as a solution to improve both metrics simultaneously.
ALISA proposes a novel algorithm-system co-design solution to accelerate Large Language Model (LLM) inference by addressing challenges imposed by KV caching, achieving significant throughput improvements.