Belangrijkste concepten
The paper introduces a groundbreaking memory optimization methodology that systematically considers the allocation and utilization of both on-chip and off-chip memory within a layerwise pipelined, streaming architecture to enable efficient mapping of modern CNN models with large parameters and complex connections to FPGA devices.
Samenvatting
The paper addresses the limitations of existing streaming architectures in scaling to modern CNN models with large parameters and complex connections, such as UNet, YOLO, and X3D, which require significant on-chip storage.
The key contributions are:
- Proposes the first streaming CNN accelerator that can partially offload weights and activations to the off-chip memory without stalling the computation pipeline.
- Introduces a subgraph-based partitioning methodology offering the latency-throughput design trade-off by exploiting the device reconfigurability of FPGAs.
- Proposes a Design Space Exploration (DSE) methodology that relies on a greedy and iterative optimization algorithm to automatically explore and determine the optimal memory and partitioning configuration.
- Accelerates a wide range of CNN benchmarks on a diverse spectrum of computer vision tasks, demonstrating competitive and even state-of-the-art performance, particularly on networks with complex, hierarchical skip connections.
The paper first provides an overview of related work on streaming architectures and accelerating various computer vision tasks on FPGAs. It then details the proposed methodology, including activation eviction, weight fragmentation, and subgraph reconfiguration. The implementation aspects and the Design Space Exploration (DSE) algorithm are also discussed.
The evaluation section compares the proposed approach, called SMOF, against prior state-of-the-art works on a range of 2D and 3D CNN models targeting different computer vision tasks, such as image segmentation, object detection, 3D segmentation, and action recognition. SMOF demonstrates significant performance improvements, achieving up to 10.65x throughput gain compared to previous works.
Statistieken
The paper provides the following key metrics for the evaluated CNN models:
UNet: 130.12 GMACs, 28.96M parameters, 53 layers, 23 convolutional layers, input size (3, 368, 480)
UNet3D: 918.64 GMACs, 5.65M parameters, 52 layers, 19 convolutional layers, input size (4, 155, 240, 240)
YOLOv8n: 4.37 GMACs, 3.16M parameters, 115 layers, 63 convolutional layers, input size (3, 640, 640)
X3D-M: 6.97 GMACs, 3.82M parameters, 396 layers, 115 convolutional layers, input size (3, 16, 256, 256)
Citaten
"The paper addresses the above limitation by introducing weight and activation eviction mechanisms to off-chip memory along the computational pipeline, taking into account the available compute and memory resources."
"SMOF has demonstrated the capacity to deliver competitive and, in some cases, state-of-the-art performance across a spectrum of computer vision tasks, achieving up to 10.65× throughput improvement compared to previous works."