The paper introduces OccFeat, a self-supervised pretraining method for camera-only Bird's-Eye-View (BEV) segmentation networks. OccFeat leverages aligned Lidar and image data, as well as a self-supervised pretrained image encoder (DINOv2), to train the BEV network on two pretraining objectives:
Occupancy reconstruction: This task enforces the BEV network to capture the 3D geometry of the scene by predicting a 3D occupancy grid from the BEV features.
Occupancy-guided feature distillation: This objective guides the BEV network to encode high-level semantic information by training it to predict the features of the self-supervised pretrained image encoder for the occupied voxels in the 3D scene.
The authors demonstrate that this combination of 3D geometry prediction and semantic feature distillation leads to significant improvements in downstream BEV semantic segmentation tasks, especially in low-data regimes (e.g., using only 1% or 10% of the annotated training data). The proposed OccFeat pretraining approach is shown to be effective across different BEV network architectures, including SimpleBEV and BEVFormer.
The authors also conduct ablation studies to validate the importance of both pretraining objectives, as well as analyze the impact of pretraining duration and image resolution. Furthermore, they show that OccFeat pretraining improves the robustness of the final BEV models, as evaluated on the nuScenes-C benchmark.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Soph... às arxiv.org 04-23-2024
https://arxiv.org/pdf/2404.14027.pdfPerguntas Mais Profundas