The paper introduces PVTransformer, a new transformer-based architecture for 3D object detection in point clouds. The key innovation is the replacement of the pooling-based PointNet design with an attention-based point-to-voxel encoding module.
The authors identify that the common PointNet design, which uses a max pooling layer to aggregate point features into voxel features, introduces an information bottleneck that limits the accuracy and scalability of 3D object detectors. To address this limitation, the PVTransformer architecture treats each point within a voxel as a token and uses a transformer-based attention module to learn an expressive point-to-voxel aggregation function.
The authors conduct extensive experiments on the Waymo Open Dataset, demonstrating that PVTransformer significantly outperforms previous state-of-the-art 3D object detectors, including PointNet-based and transformer-based approaches. PVTransformer achieves a new state-of-the-art 76.5 mAPH L2 on the Waymo Open Dataset test set, outperforming the prior art by 1.7 mAPH L2.
The paper also presents a systematic study on the scalability of transformer-based 3D detectors, showing that PVTransformer exhibits better scaling properties compared to scaling PointNet and voxel architectures. The authors identify that the proposed point-to-voxel transformer is a key factor enabling effective scaling of the overall 3D detection model.
To Another Language
from source content
arxiv.org
ข้อมูลเชิงลึกที่สำคัญจาก
by Zhaoqi Leng,... ที่ arxiv.org 05-07-2024
https://arxiv.org/pdf/2405.02811.pdfสอบถามเพิ่มเติม