The paper presents T10, a deep learning compiler designed to optimize the execution of deep neural network (DNN) models on inter-core connected intelligence processors, such as the Graphcore IPU.
Key highlights:
T10 introduces a new tensor abstraction called RotatingTensor (rTensor) to represent the partitioning and communication patterns of tensor operators on the distributed on-chip memory. rTensor enables T10 to map DNN models to efficient compute-shift execution plans.
T10 defines a comprehensive optimization space by configuring rTensors in different ways, and adopts a two-stage optimization strategy to handle the tradeoff between inter-core communication and memory footprint. It first finds the Pareto-optimal execution plans for each operator, and then employs a holistic inter-operator memory reconciliation policy to determine the best end-to-end plan for the entire DNN model.
T10 abstracts three key device interfaces (allocate, compute, and shift) to map the optimized execution plan to the target inter-core connected AI accelerator. It also develops a sub-tensor placement algorithm to ensure data dependencies are satisfied during the rotating shifts.
Evaluation on a real Graphcore IPU MK2 chip shows that T10 can achieve up to 3.3x performance improvement and support much larger models compared to state-of-the-art DL compilers and vendor libraries.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania