The paper proposes a novel framework called MOSE (MOnocular 3D object detection with Scene cuEs) for 3D object detection using roadside cameras. The key insights are:
Roadside cameras have unique characteristics, such as fixed installation and frame-invariant scene-specific features, which can be leveraged to improve 3D object detection.
The authors introduce "scene cues" - the relative height between the surface of the real road and the virtual ground plane, which are crucial for accurate object localization. A scene cue bank is designed to aggregate these scene cues from multiple frames of the same scene.
A transformer-based decoder is used to lift the aggregated scene cues and 3D position embeddings for 3D object location prediction, which boosts the generalization ability in heterologous scenes.
The paper also introduces a scene-based data augmentation strategy to improve the model's robustness to varying camera parameters.
Extensive experiments on the Rope3D and DAIR-V2X datasets demonstrate that the proposed MOSE framework achieves state-of-the-art performance, significantly outperforming existing methods, especially in terms of detecting ability and localization precision on heterologous scenes.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Xiahan Chen,... às arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.05280.pdfPerguntas Mais Profundas