核心概念
The paper proposes efficient FPGA accelerator cores, PointLKCore and ReAgentCore, for deep learning-based point cloud registration methods that avoid costly feature matching.
摘要
The paper presents FPGA accelerator designs for two correspondence-free point cloud registration methods, PointNetLK and ReAgent, which leverage PointNet features to align point clouds.
Key highlights:
- The authors design a lightweight and pipelined PointNet feature extractor module that reduces the on-chip memory consumption from O(N) to O(B), where N is the number of input points and B is the tile size.
- For PointNetLK, the authors introduce an improved Jacobian computation method using central difference approximation, which provides better accuracy compared to the standard backward difference approach, especially under quantization.
- The proposed accelerator cores, PointLKCore and ReAgentCore, are implemented on the Xilinx ZCU104 FPGA board. They leverage the simplified PointNet architecture and lookup-table based quantization to store all network parameters on-chip, eliminating most of the off-chip memory accesses.
- Experimental results show that the proposed accelerators achieve 44.08-45.75x speedup over ARM Cortex-A53 CPU and 1.98-11.13x speedup over Intel Xeon CPU and Nvidia Jetson boards, while consuming less than 1W and offering 163.11-213.58x energy-efficiency compared to Nvidia GeForce GPU.
- The accelerators demonstrate real-time performance, finding reasonable registration solutions in less than 15ms, and are more robust to noise and large initial misalignments than classical methods.
统计
The paper reports the following key performance metrics:
Speedup over ARM Cortex-A53 CPU: 44.08-45.75x
Speedup over Intel Xeon CPU and Nvidia Jetson boards: 1.98-11.13x
Energy-efficiency compared to Nvidia GeForce GPU: 163.11-213.58x
Registration time: less than 15ms
引用
"To the best of our knowledge, we are the first to introduce FPGA accelerators for the deep learning-based point cloud registration."
"We develop accurate performance models for the proposed accelerators. Based on these, we conduct the design-space exploration to fully harness the available resources on a specified FPGA board and minimize the latency."
"For resource-efficiency, we apply the low-overhead lookup-table quantization [33] to the network parameters. While it is previously applied to the famous semantic tasks (e.g., classification and segmentation), we show its effectiveness in the geometric tasks for the first time."