Core Concepts
Vision-based 3D occupancy prediction is a promising solution for providing fine-grained representation and robust detection of the surrounding environment for autonomous driving applications.
Abstract
This paper provides a comprehensive review of the current progress in vision-based 3D occupancy prediction for autonomous driving. It first introduces the background, task definition, ground truth generation, common datasets, and evaluation metrics for this task. The key challenges include obtaining perfect 3D features from 2D visual inputs, heavy computational load in 3D space, and expensive fine-grained annotation.
To address these challenges, the paper categorizes existing methods into three main lines: feature enhancement, deployment-friendly, and label-efficient approaches.
Feature enhancement methods aim to improve occupancy prediction by learning from Bird's Eye View (BEV), Tri-Perspective View (TPV), and 3D voxel representations. BEV-based methods leverage the advantages of BEV features, which are insensitive to occlusion and contain certain depth geometric information. TPV-based methods utilize three orthogonal projection planes to model the 3D environment, further enhancing the representation capability. Voxel-based methods directly operate on 3D voxel representations to capture complete spatial information.
Deployment-friendly methods focus on significantly reducing resource consumption while ensuring performance by designing concise and efficient network architectures, including perspective decomposition and coarse-to-fine paradigms.
Label-efficient methods aim to achieve satisfactory performance even with insufficient or completely absent annotations, including annotation-free and LiDAR-free approaches.
Finally, the paper proposes some inspiring future outlooks for vision-based 3D occupancy prediction from the perspectives of data, methodology, and task.
Stats
"The goal of vision-based 3D occupancy prediction is to achieve detailed perception and comprehension of the 3D scene solely from image inputs."
"3D occupancy prediction typically requires representing the environmental space using 3D voxel features, which inevitably involves operations like 3D convolutions for feature extraction, substantially increasing computational and memory overhead and hindering practical deployment."
"Achieving fine-grained semantic annotation for each voxel is both time-consuming and costly, posing a bottleneck for this task."
Quotes
"Vision-based 3D occupancy prediction, which predicts the spatial occupancy status and semantic categories of 3D voxel grids around the autonomous vehicle from image inputs, is a promising solution for providing fine-grained representation and robust detection for undefined long-tail obstacles in 3D space."
"Occupancy representation originated from the field of robotics, where 3D space is divided into voxel units for binary prediction of whether voxels are occupied by objects, enabling effective collision avoidance."