核心概念
Allo provides a composable programming model that decouples hardware customizations from algorithm specifications, enabling progressive and verifiable transformations to construct high-performance spatial accelerator designs.
摘要
The paper introduces Allo, a new programming model for composable design of high-performance spatial accelerator architectures. Allo decouples hardware customizations, including compute, memory, communication, and data type, from algorithm specification, and encapsulates them as a set of customization primitives. Allo preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner, facilitating holistic optimizations that span across function boundaries.
The key features of Allo include:
- Progressive hardware customizations: Allo provides decoupled hardware customization primitives, allowing users to progressively transform a vanilla program into a high-performance design, with each step being verifiable.
- Composable schedules: Allo enables users to construct modular hardware accelerators from the ground up by combining customized kernels and external IPs. A type system for the memory layout is also proposed to ensure type safety during schedule composition.
- Holistic dataflow optimizations: Allo introduces a hierarchical dataflow graph to support the composition of multiple kernels within a complex design while maintaining the function boundaries. It ensures the correctness of the interfaces when integrating distinct kernels and effectively sizes the streaming buffers (FIFOs) between stages.
The authors conduct comprehensive experiments on both realistic benchmarks and large neural networks. For PolyBench, Allo outperforms several state-of-the-art HLS tools and ADLs. Furthermore, Allo is applied to the evaluation of large language models (LLMs) on an FPGA, demonstrating a 1.7x speedup and 5.4x higher energy efficiency on the GPT2 model compared to the A100 GPU.
統計資料
The latency of the vanilla GEMM implementation is 25,074 ms.
The latency of the inner-product GEMM implementation is 17,950 ms.
The latency of the row-wise product GEMM implementation is 112 ms.
The latency of the simple cascade of two GEMM kernels is 280 ms.
The latency of the interface-unified cascade of two GEMM kernels is 224 ms.
引述
"Existing HLS tools often require intrusive source-level changes to attain satisfactory quality of results."
"Existing ADLs prove less effective for realistic hierarchical designs with multiple kernels, even if the design hierarchy is flattened."
"Allo decouples hardware customizations, including compute, memory, communication, and data type from algorithm specification, and encapsulates them as a set of customization primitives."
"Allo preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner."