Khái niệm cốt lõi
A cycle-accurate performance model was developed to guide the implementation of a superscalar version of the open-source CVA6 RISC-V processor, resulting in a 40% performance improvement on the CoreMark benchmark.
Tóm tắt
The authors developed a cycle-accurate performance model of the CVA6 RISC-V processor in Python to enable efficient architectural exploration and implementation of performance-enhancing features. The model achieved 99.2% accuracy on the CoreMark benchmark compared to the RTL implementation.
Using the performance model, the authors designed and implemented a superscalar version of CVA6 with the following key steps:
64-bit instruction fetch: Increased the instruction fetch width from 32-bit to 64-bit, resulting in a 1% performance improvement.
Dual issue (single ALU): Enabled dual-issue capability with a single ALU, leading to a 21% performance gain.
Superscalar (two ALUs): Added a second ALU to create a fully superscalar design, further improving performance by 47%.
The model was instrumental in identifying and fixing performance bugs during the implementation phase. For example, the authors discovered an issue with the scoreboard management that was degrading performance on embedded systems. They addressed this by enhancing the scoreboard logic to better handle the limited resources.
The final superscalar CVA6 implementation achieved a 40% performance improvement on the CoreMark benchmark compared to the single-issue reference design, with a 11% increase in area. The authors also observed a 24% performance gain on the Dhrystone benchmark, validating the effectiveness of their model-driven approach.
The authors plan to further enhance the performance model by incorporating support for divisions, data caching, and instruction caching. They also intend to explore the impact of register renaming on the superscalar CVA6 design, as it could significantly improve performance on benchmarks with more Write-After-Write (WAW) hazards.
Thống kê
The superscalar CVA6 achieved a 40.1% performance improvement on the CoreMark benchmark compared to the single-issue reference design.
The maximum frequency of the superscalar CVA6 decreased by 1.75% compared to the reference design.
The power consumption of the superscalar CVA6 increased by 7.37% compared to the reference design.
The area of the superscalar CVA6 increased by 11.1% compared to the reference design.
Trích dẫn
"The superscalar feature resulted in a CVA6 performance improvement of 40% on CoreMark."
"Our model provides 3 levels of comparison with RTL results from Verilator for debugging."
"To discard instructions, simply removing them from Scoreboard does not work because the removed instruction is not cancelled in Execute stage: it can write back the wrong result into the entry of the correct instruction."