Khái niệm cốt lõi
The authors present a novel multi-modal 3D semantic occupancy prediction framework, Co-Occ, which couples explicit LiDAR-camera feature fusion with implicit volume rendering regularization to effectively leverage the complementary strengths of LiDAR and camera data.
Tóm tắt
The authors propose a multi-modal 3D semantic occupancy prediction framework, Co-Occ, which consists of two key components:
-
Explicit Geometric- and Semantic-Aware Fusion (GSFusion) Module:
- Extracts features from LiDAR and camera data and projects them into a unified voxel space.
- Fuses the features using a KNN-based approach to explicitly incorporate the semantic information from camera features into LiDAR features, particularly for sparse input.
-
Implicit Volume Rendering-based Regularization:
- Casts rays from the camera into the scene and samples along these rays uniformly.
- Retrieves the corresponding features of the samples from the fused feature and uses two auxiliary heads to predict the density and color of these samples.
- Projects the rendered color and depth maps back onto the 2D image plane and supervises them with ground truth from cameras and LiDAR, respectively.
- This enables the framework to effectively bridge the gap between 3D LiDAR sweeps and 2D camera images and enhance the fused volumetric representation.
The authors conduct extensive experiments on the nuScenes and SemanticKITTI benchmarks, demonstrating that their Co-Occ framework outperforms state-of-the-art methods in 3D semantic occupancy prediction.
Thống kê
The nuScenes dataset provides 3D occupancy ground truth with a voxel size of [0.5m, 0.5m, 0.5m] and [200, 200, 16] dense voxel grids.
The SemanticKITTI dataset provides 3D semantic occupancy ground truth with a voxel size of [0.2m, 0.2m, 0.2m] and [256, 256, 32] voxel grids.
Trích dẫn
"Leveraging the complementary strengths of LiDAR and camera data is crucial in various 3D perception tasks."
"The fusion of LiDAR-camera data for 3D semantic occupancy prediction is not a straightforward task due to the heterogeneity between the modalities and the limited interaction between them."