Core Concepts
Reinforcement learning can be effectively leveraged to learn efficient backfilling strategies that outperform traditional heuristic-based approaches, by directly optimizing for scheduling performance metrics like average bounded job slowdown.
Abstract
This paper proposes RLBackfilling, a reinforcement learning-based approach to improve backfilling strategies for scheduling High-Performance Computing (HPC) batch jobs.
The key insights are:
Accurate job runtime prediction does not necessarily lead to better scheduling performance, as there is a trade-off between prediction accuracy and backfilling opportunities.
Reinforcement learning can be used to directly learn effective backfilling strategies, without relying on explicit job runtime predictions.
The RLBackfilling approach works as follows:
It defines the backfilling decision-making as a reinforcement learning problem, with the current state of the job queue and resource availability as the observation, and the selection of jobs to backfill as the action.
It uses a deep neural network-based agent to learn the optimal backfilling policy through trial-and-error on historical job traces, with the goal of minimizing the average bounded job slowdown.
The trained RLBackfilling agent can then be used to make backfilling decisions during actual job scheduling, working seamlessly with different base scheduling policies like FCFS, SJF, etc.
The evaluation results show that RLBackfilling can outperform traditional EASY backfilling by up to 59% in terms of average bounded job slowdown, and even outperform EASY backfilling using the ideal predicted job runtime by up to 30%. Additionally, the RLBackfilling agent trained on one job trace can also generalize well to other unseen traces, demonstrating its versatility.
Stats
The average bounded job slowdown (bsld) metric is used to measure scheduling performance.
Quotes
"There is a missing trade-off between prediction accuracy and backfilling opportunities."
"Higher runtime prediction accuracy does not necessarily lead to better scheduling performance, and pursuing better predictive models alone may not be sufficient to improve scheduling."