Core Concepts
The core message of this work is to introduce the new task of outdoor 3D dense captioning, which aims to localize and describe all objects in a 3D outdoor scene using natural language. To facilitate research in this area, the authors propose the TOD3Cap dataset, the largest to date for 3D dense captioning in outdoor scenes, and develop the TOD3Cap network, a transformer-based architecture that effectively addresses the unique challenges of outdoor 3D dense captioning.
Abstract
The authors introduce the task of outdoor 3D dense captioning, which involves localizing and describing all objects in a 3D outdoor scene using natural language. This task poses unique challenges compared to indoor 3D dense captioning, such as dynamic scenes, sparse LiDAR point clouds, fixed camera perspectives, and larger scene areas.
To address this task, the authors propose the TOD3Cap dataset, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes in the nuScenes dataset. This is the largest dataset for 3D dense captioning in outdoor scenes.
The authors also introduce the TOD3Cap network, a transformer-based architecture that leverages the BEV representation to generate object box proposals and integrates a Relation Q-Former with a LLaMA-Adapter to generate rich captions for these objects. Experiments show that the TOD3Cap network outperforms adapted state-of-the-art indoor methods by a significant margin (+9.6 CiDEr@0.5IoU).
The key highlights of the TOD3Cap dataset and network are:
- Introduction of the outdoor 3D dense captioning task and its unique challenges.
- Proposal of the TOD3Cap dataset, the largest to date for 3D dense captioning in outdoor scenes.
- Development of the TOD3Cap network, a transformer-based architecture that effectively addresses the outdoor 3D dense captioning task.
- Significant performance improvement over adapted state-of-the-art indoor methods.
Stats
The TOD3Cap dataset contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes in the nuScenes dataset.
Quotes
"We introduce the task of 3D dense captioning in outdoor scenes (right). Given point clouds (right middle) and multi-view RGB inputs (right top), we predict box-caption pairs of all objects in a 3D outdoor scene."
"To this end, we introduce the TOD3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes in nuScenes."