insight - Algorithms and Data Structures - # Efficient Multi-Processor Scheduling of Computational DAGs

Core Concepts

Scheduling computational DAGs on multiple processors can be significantly improved by considering realistic system parameters such as communication costs and NUMA effects, using advanced algorithms like local search and ILP-based methods.

Abstract

The paper studies the problem of efficiently scheduling a computational DAG on multiple processors. In contrast to previous works that have focused on relatively simple models, the authors analyze this problem in a more realistic model that captures many real-world aspects, such as communication costs, synchronization costs, and the hierarchical structure of modern processing architectures.
The authors extend the well-established BSP model of parallel computing with non-uniform memory access (NUMA) effects. They then develop a range of new scheduling algorithms to minimize the scheduling cost in this more complex setting: several initialization heuristics, a hill-climbing local search method, and several approaches that formulate (and solve) the scheduling problem as an Integer Linear Program (ILP).
The authors combine these algorithms into a single framework and conduct experiments on a diverse set of real-world computational DAGs. The results show that the resulting scheduler significantly outperforms both academic and practical baselines. Even without NUMA effects, the scheduler finds solutions with 24%-44% smaller cost on average than the baselines. In case of NUMA effects, it achieves up to a factor 2.5x improvement compared to the baselines. The authors also develop a multilevel scheduling algorithm, which provides up to almost a factor 5x improvement in the special case when the problem is dominated by very high communication costs.

Stats

The computation weight of a node 𝑣is denoted as 𝑤(𝑣).
The communication weight of a node 𝑣is denoted as 𝑐(𝑣).
The number of processors is denoted as 𝑃.
The communication cost per unit of data is denoted as 𝑔.
The latency cost per superstep is denoted as ℓ.
The NUMA communication cost multiplier between processors 𝑝1 and 𝑝2 is denoted as 𝜆𝑝1,𝑝2.

Quotes

"We study the problem of efficiently scheduling a computational DAG on multiple processors."
"We then develop a range of new scheduling algorithms to minimize the scheduling cost in this more complex setting: several initialization heuristics, a hill-climbing local search method, and several approaches that formulate (and solve) the scheduling problem as an Integer Linear Program (ILP)."
"Even without NUMA effects, our scheduler finds solutions of 24% −44% smaller cost on average than the baselines, and in case of NUMA effects, it achieves up to a factor 2.5× improvement compared to the baselines."

Deeper Inquiries

To improve the scalability of the proposed scheduling algorithms for extremely large computational DAGs, several strategies can be implemented:
Parallelization: Implement parallel processing techniques to distribute the computational load across multiple cores or machines. By parallelizing the scheduling algorithms, the processing time can be significantly reduced, allowing for faster scheduling of large DAGs.
Optimized Data Structures: Utilize optimized data structures and algorithms to handle the large-scale DAGs more efficiently. This includes using data structures that can handle the size and complexity of the DAGs without compromising performance.
Incremental Scheduling: Implement an incremental scheduling approach where the DAG is divided into smaller sub-DAGs that can be scheduled independently. This incremental approach can help in managing the scheduling of large DAGs more effectively.
Heuristic Refinement: Develop more sophisticated heuristic algorithms that can quickly generate high-quality initial schedules for large DAGs. These heuristics can help in reducing the search space and improving the efficiency of the scheduling algorithms.
Resource-aware Scheduling: Incorporate resource-aware scheduling techniques that take into account the specific characteristics of the underlying hardware architecture, such as NUMA effects, to optimize task allocation and communication.
By incorporating these strategies, the scheduling algorithms can be enhanced to scale better to extremely large computational DAGs, ensuring efficient and effective scheduling of complex computational workloads.

The theoretical complexity bounds and approximation guarantees for ILP-based scheduling approaches can vary depending on the specific formulation of the ILP problem and the characteristics of the input DAGs. Here are some general considerations:
Complexity Bounds: The complexity of solving ILP problems is NP-hard, meaning that there is no known polynomial-time algorithm to solve them optimally. The time complexity of ILP solvers can vary based on the size of the ILP formulation, the number of variables and constraints, and the specific characteristics of the problem instance.
Approximation Guarantees: In practice, ILP solvers often provide near-optimal solutions for scheduling problems. While there may not be formal approximation guarantees for ILP-based scheduling approaches, the solutions obtained are typically very close to the optimal solution. The quality of the solution depends on the formulation of the ILP, the solver used, and the specific characteristics of the scheduling problem.
Heuristic Solutions: In cases where solving the ILP problem optimally is computationally infeasible, heuristic approaches can be used to find good-quality solutions within a reasonable time frame. These heuristic solutions may not have formal approximation guarantees but can provide effective scheduling solutions in practice.
Overall, ILP-based scheduling approaches offer a powerful framework for optimizing scheduling problems, with the trade-off between computational complexity and solution quality depending on the specific problem instance and algorithm used.

The multilevel scheduling framework can be generalized to other parallel computing models beyond BSP by adapting the coarsening, solving, and uncoarsening steps to the specific characteristics of the target model. Here's how it could be extended to other parallel computing models:
Model-specific Coarsening: Modify the coarsening step to capture the hierarchical structure and communication patterns of the target parallel computing model. For example, in a different model with distinct communication costs or synchronization requirements, the coarsening process would need to reflect these differences.
Solving Algorithm: Develop a solving algorithm tailored to the constraints and objectives of the new parallel computing model. This algorithm should optimize task allocation, communication, and synchronization based on the specific characteristics of the model.
Uncoarsening and Refinement: Adjust the uncoarsening and refinement steps to account for the unique features of the target model. This may involve refining the schedule to meet model-specific constraints or performance metrics.
Performance Evaluation: Conduct thorough performance evaluations to assess the effectiveness of the multilevel scheduling framework in the new parallel computing model. Compare the results against existing scheduling approaches to determine the framework's impact on scheduling efficiency and performance.
By generalizing the multilevel scheduling framework to other parallel computing models, it can be adapted to different system architectures and requirements, potentially improving scheduling performance and scalability across a broader range of computational environments.

0