Core Concepts
The author argues that efficient host-driver code generation is crucial for maximizing the potential of custom hardware accelerators, proposing specific data-related optimizations to enhance accelerator utilization and reduce latency.
Abstract
Automated generation of efficient host-driver code is essential as custom hardware accelerators become more prevalent. The study focuses on optimizing data transfers to improve accelerator utilization and reduce latency. Proposed optimizations include DMA-based data allocation, data coalescing, and software pipelining.
The research identifies under-utilization of accelerators due to inefficient data transfer mechanisms between the heap and memory-mapped DMA buffers. By extending AXI4MLIR with key optimizations, the study aims to address these bottlenecks effectively. The proposed optimizations aim to streamline data transfers, increase compute core utilization, and reduce latency in executing linear algebra operations.
Key findings reveal that current implementations achieve less than 10% utilization of accelerator compute cores. The proposed optimizations target improving this under-utilization by introducing DMA-based data allocation, coalescing of DMA transfers, and pipelining of load, compute, and store stages. These enhancements aim to maximize the efficiency of custom accelerators for linear algebra problems.
Stats
First, the accelerator’s compute core utilization is less than 10%.
Second, the critical latency bottleneck is caused by copying data between the heap and memory-mapped DMA buffers.
Figure 1 shows a breakdown of clock cycles spent inside a simple MatMul accelerator.
Quotes
"Efficient communication between the accelerator, off-chip memory, and host CPU demands effective hardware-software co-design."
"Manually developing host driver code for different applications is time-consuming and error-prone."
"AXI4MLIR provides a solution for generating efficient host-driver code."