Core Concepts
Micro-batching, a technique of grouping requests that arrive closely in time, can significantly improve the efficiency and quality of load balancing by enabling more informed placement decisions.
Abstract
The article discusses the benefits of using micro-batching techniques to enhance load balancing algorithms, particularly the "power of two random choices" algorithm.
The author explains that in high-traffic scenarios, requests for workload placement can be grouped into small micro-batches by waiting for a few milliseconds. This allows for more efficient processing, such as in GPU inferencing, where batching requests can improve utilization and response times.
The author then applies this concept to load balancing, suggesting that instead of drawing random samples independently for each workload, the samples can be drawn for a micro-batch of workloads. This enables sorting the available workers by their remaining capacity and the workloads by their requirements, leading to better placement decisions.
The article presents a simulation scenario with a fleet of 8,000 workers and 120,000 workloads to compare the performance of random placement versus micro-batching-based placement. The results demonstrate that micro-batching can significantly improve the overall placement quality, even with a small batch size.
Stats
The fleet has 8,000 workers, each with 250GB of memory.
The workloads have memory requirements drawn from a gamma distribution between 1 and 50 MB, with k = 5 and theta = 3.