The paper proposes MACO, a loosely-coupled multi-core general-purpose processor architecture optimized for GEMM-related applications. MACO consists of up to 16 homogeneous compute nodes interconnected by a network-on-chip (NOC), with each node integrating a CPU core and a dedicated MMAE (Matrix Multiplication Acceleration Engine).
To improve the programmability and flexibility of MACO, the authors introduce a tile-based instruction set architecture called MPAIS, which provides instructions for data migration, tile GEMM computation, and task management. MACO also employs techniques such as hardware-assisted data prefetching and locking, as well as predictive address translation, to enhance the computational efficiency for GEMM workloads.
The experimental results demonstrate that MACO exhibits good scalability, achieving an average computational efficiency of 90% across multiple cores. Furthermore, MACO can achieve up to 1.1 TFLOPS with 88% computational efficiency on state-of-the-art deep neural network workloads, indicating its adaptivity to deep learning applications.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Bingcai Sui,... at arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19180.pdfDeeper Inquiries