Optimizing Throughput for Constraint-Aware Execution of Large Language Models
ExeGPT, a distributed system, finds and runs with an optimal execution schedule to maximize inference throughput while satisfying a given latency constraint by leveraging the distribution of input and output sequences to effectively allocate resources and determine optimal execution configurations.