Belangrijkste concepten
Effective load balancing strategies are crucial for ensuring efficient and stable data processing in Apache Kafka-based applications, especially when dealing with heterogeneous hardware and uneven workloads.
Samenvatting
The article discusses the challenges of load balancing in Apache Kafka and presents strategies to address them. It starts by explaining the importance of Kafka in Agoda's architecture, where it is used to handle the massive flow of data across various supply systems.
The key challenges identified are:
- Heterogeneous hardware: Agoda's private cloud deployment leads to servers with varying processing capabilities, affecting the overall performance.
- Uneven workload for each Kafka message: Different messages may require different processing steps, resulting in varying processing rates.
The article then explores several solutions to address these challenges:
-
Static Balancing Solutions:
- Deployment on Identical Pods: Ensuring all pods have the same hardware configuration, but this may not be feasible in a private cloud environment.
- Weighted Load Balancing: Assigning varying weights to different consumers based on their estimated capacities, but this approach may not be practical due to the dynamic nature of the system.
-
Lag-Aware Strategies:
- Lag-Aware Producers: Producers that consider the lag information of the target topic to dynamically adjust the message distribution, suitable for use cases with a dedicated producer.
- Lag-Aware Consumers: Consumers that monitor the current lags and trigger a rebalance to redistribute the load, useful when there are multiple consumer groups.
The article also mentions the concept of cluster-level load balancing, where the Kafka cluster itself can be configured to distribute the load based on the nature of the messages.
The implementation of these strategies has resulted in a 50% reduction in the resources allocated to the supply system while maintaining the service-level agreement (SLA) in a stable state, without the need for manual mitigation.
Statistieken
The article provides the following key metrics:
A single supplier can provide 1.5M price updates and offer details in just one minute.
Benchmark of processing capacity using different hardware generations shows significant differences in performance.
Citaten
"Any delays or failures in reflecting these updates can lead to incorrect pricing and customer booking failures."
"While round-robin ensures a perfectly even distribution of messages, it does not guarantee a perfect distribution in terms of performance in this case."
"To process 50 messages per second, we would need to scale up to five machines to ensure the timely processing of all messages. This results in overprovisioning two additional machines to this system because of this inappropriate distribution logic (66.7% overprovisioning)."