Core Concepts
This work investigates the efficient mapping of parallel general matrix multiplication (GEMM) to the AMD Versal Adaptive Compute Accelerated Platform (ACAP) equipped with multiple Artificial Intelligence Engines (AIEs) to accelerate deep learning inference.
Abstract
The paper presents a customized design of parallel GEMM for the AMD Versal ACAP, a heterogeneous system-on-chip that integrates ARM processors, FPGA, and an array of high-performance vector AIEs. The key contributions are:
Memory Mapping: The authors leverage the multi-level memory hierarchy of the Versal ACAP, including the FPGA Ultra/Block RAMs and the local memory of the AIE tiles, to efficiently distribute the matrix operands and exploit data reuse.
Architecture-Specific Micro-kernel: To address the demand for low-precision inference in deep learning, the authors propose a micro-kernel design that utilizes the SIMD units in the AIE tiles to perform mixed-precision arithmetic.
Parallel Design: The authors introduce a parallel GEMM design that distributes the computation across multiple AIE tiles, analyzing the theoretical performance and conducting experimental profiling to demonstrate the high parallel scalability.
The paper first provides an overview of high-performance GEMM algorithms and the architecture of the Versal ACAP. It then delves into the details of the customized GEMM design, including the mapping of matrix operands to the memory hierarchy, the micro-kernel implementation, and the parallelization strategy. Finally, the authors present a comprehensive performance analysis, identifying communication bottlenecks and outlining potential optimization strategies.
Stats
The paper reports the following key performance metrics:
For a single AIE tile, the GEMM design achieves 31.5 MACs/cycle.
When scaling to 32 AIE tiles, the performance per tile only degrades by 5.7%, demonstrating high parallel scalability.
The micro-kernel design is able to overlap the data transfers and arithmetic computations, with the total execution time being equivalent to the cost of the data transfers.
Quotes
"The parallel design is highly scalable, with a parallel efficiency that, in a strong scaling scenario, only degrades by 5% when increasing the number of AIE tiles from 1 to 32."
"The implementation is memory-bound on this platform mostly due to the low bandwidth of the FPGA Ultra RAM."