The paper focuses on efficiently processing and analyzing GEMM workloads on the Versal ACAP and Stratix 10 NX FPGA platforms. It makes the following key contributions:
For the Versal ACAP, the authors leverage the state-of-the-art MaxEVA framework and extend it to incorporate an additional memory hierarchy level utilizing the Versal FPGA's on-chip resources. They maximize performance via design space exploration (DSE) and analytical modeling, and propose a novel RAM optimization scheme to overcome limitations of Vitis High-Level Synthesis (HLS).
For the Stratix 10 NX, the authors develop a novel framework to design, map and optimize a configurable GEMM accelerator by exploiting the device's in-fabric Tensor Blocks (TBs). Their framework involves extensive DSE and analytical modeling to maximize GEMM performance.
The authors demonstrate their frameworks on various GEMM workloads for int8 precision, showing throughput up to 77 and 68 TOPs with 100% AIE and 91% TB utilization for Versal and Stratix, respectively. They achieve up to 0.94 and 1.35 TOPs/W energy efficiency, with 88% and 94% on-chip memory for Versal and Stratix, respectively.
The paper provides notable insights and guidelines for GEMM optimization, programmability aspects, architectural attributes, and limitations on both AI-optimized FPGAs.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Endri Taka,D... at arxiv.org 04-18-2024
https://arxiv.org/pdf/2404.11066.pdfDeeper Inquiries