核心概念
Specialized software-only framework Relic enables significant performance improvements over state-of-the-art parallel programming frameworks for fine-grained tasks on simultaneous multithreading CPU cores.
摘要
The paper explores the performance of fine-grained task parallelism on simultaneous multithreading (SMT) CPU cores. It first conducts a performance analysis of seven state-of-the-art shared-memory parallel programming frameworks, including OpenMP, Intel oneAPI Thread Building Blocks, OpenCilk, and Taskflow, using real-world fine-grained application kernels such as graph algorithms and JSON parsing. The results show performance degradations on several fine-grained tasks with the existing frameworks.
To address this, the paper introduces Relic, a specialized parallel programming framework designed to enable extremely fine-grained task parallelism on SMT cores. Relic utilizes a simple single-producer single-consumer task scheduling mechanism and optimized waiting/suspension mechanisms to reduce task handling overheads. Evaluation results demonstrate that Relic achieves significant performance improvements over the state-of-the-art frameworks, with speedups ranging from 19.1% to 33.2% across the investigated benchmarks.
The key insights are:
- Existing general-purpose parallel frameworks struggle to efficiently parallelize fine-grained tasks on SMT cores.
- Relic, a specialized framework, can enable much higher performance for fine-grained parallelism on SMT cores through simplicity and targeted optimizations.
- Software-only techniques can be effective in improving fine-grained task parallelism on SMT cores without requiring hardware changes.
统计
Relic framework increases performance speedups over serial implementations by:
19.1% compared to LLVM OpenMP
31.0% compared to GNU OpenMP
20.2% compared to Intel OpenMP
33.2% compared to X-OpenMP
30.1% compared to oneTBB
23.0% compared to Taskflow
21.4% compared to OpenCilk
引用
"Relic parallel programming framework increases achievable performance speedups by 30.4%, 30.1%, 14.3%, 21.3%, and 8.6% for the BC, CC, PR, SSSP, and JSON parsing benchmarks, respectively, compared to the best results from the evaluated state-of-the-art frameworks."