Core Concepts
Efficiently deploying visual Transformers on resource-limited PIM systems to minimize inference latency.
Abstract
The article introduces Allspark, a framework focusing on workload orchestration for visual Transformers on PIM systems. It addresses challenges in deploying Transformer models efficiently and presents solutions through finer-grained partitioning, systematic layout, and interleaved dataflows. The scheduling method based on tensor parallelization aims to minimize inference latency by optimizing the allocation of computational branches across temporal layers. Memory constraints drive weight sharing and reuse strategies to alleviate memory burden and improve performance.
Structure:
- Introduction to Transformers and their challenges.
- Overview of Processing In-Memory (PIM) systems.
- Challenges with end-to-end inference deployment.
- Framework overview of Allspark.
- Partitioning and Dataflow Formation details.
- Scheduling for End-to-End Inference methodology.
- Memory Constraint-driven Weight Sharing and Reuse strategies.
Key Highlights:
- Introduction of Allspark framework for workload orchestration in visual Transformers on PIM systems.
- Utilization of finer-grained partitioning, systematic layout, and interleaved dataflows for efficient deployment.
- Scheduling method based on tensor parallelization to minimize inference latency.
- Memory constraint-driven weight sharing and reuse strategies to alleviate memory burden.
Stats
Extensive experiments show that Allspark brings 1.2×∼24.0× inference speedup over baselines.
Allspark-enriched PIM system yields average speedups of 2.3× and energy savings of 20×∼55× over Nvidia V100 GPU.