The paper presents T10, a deep learning compiler designed to optimize the execution of deep neural network (DNN) models on inter-core connected intelligence processors, such as the Graphcore IPU.
Key highlights:
T10 introduces a new tensor abstraction called RotatingTensor (rTensor) to represent the partitioning and communication patterns of tensor operators on the distributed on-chip memory. rTensor enables T10 to map DNN models to efficient compute-shift execution plans.
T10 defines a comprehensive optimization space by configuring rTensors in different ways, and adopts a two-stage optimization strategy to handle the tradeoff between inter-core communication and memory footprint. It first finds the Pareto-optimal execution plans for each operator, and then employs a holistic inter-operator memory reconciliation policy to determine the best end-to-end plan for the entire DNN model.
T10 abstracts three key device interfaces (allocate, compute, and shift) to map the optimized execution plan to the target inter-core connected AI accelerator. It also develops a sub-tensor placement algorithm to ensure data dependencies are satisfied during the rotating shifts.
Evaluation on a real Graphcore IPU MK2 chip shows that T10 can achieve up to 3.3x performance improvement and support much larger models compared to state-of-the-art DL compilers and vendor libraries.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Yiqi Liu, Yu... lúc arxiv.org 09-25-2024
https://arxiv.org/pdf/2408.04808.pdfYêu cầu sâu hơn