The author proposes a weakly supervised monocular 3D detection framework that leverages depth information exclusively from single-view images, achieving state-of-the-art performance without the need for additional training data like LiDAR point clouds or multi-view images.
The author proposes a framework for training monocular 3D object detection models using joint datasets, enhancing generalization capabilities and performance on new datasets with only 2D labels.
VFMM3D, a novel framework that synergistically integrates the Segment Anything Model (SAM) and Depth Anything Model (DAM) to generate high-quality pseudo-LiDAR data enriched with semantic information and accurate depth, enabling state-of-the-art performance in monocular 3D object detection.