The paper introduces InverseMatrixVT3D, an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods often rely on depth estimation, device-specific operators, or transformer queries, which hinders the widespread adoption of 3D occupancy models.
In contrast, the proposed approach leverages two projection matrices to store the static mapping relationships and perform matrix multiplications to efficiently generate global BEV features and local 3D feature volumes. A sparse matrix handling technique is introduced to optimize GPU memory usage for the projection matrices. Additionally, a global-local attention fusion module is proposed to integrate the global BEV features with the local 3D feature volumes to obtain the final 3D volume. A multi-scale supervision mechanism is also employed to enhance performance further.
Extensive experiments on the nuScenes and SemanticKITTI datasets reveal that the proposed approach not only stands out for its simplicity and effectiveness but also achieves the top performance in detecting vulnerable road users (VRU), which is crucial for autonomous driving and road safety.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問