Khái niệm cốt lõi
The core message of this paper is to present the first approach for 3D open-vocabulary panoptic segmentation in autonomous driving by leveraging large vision-language models and proposing novel loss functions for effective learning.
Tóm tắt
The paper addresses the problem of 3D open-vocabulary panoptic segmentation, which aims to predict both semantic and instance annotations for 3D points in a scene, including for unseen object categories.
The key highlights are:
- Existing 3D panoptic segmentation models can only predict for a closed-set of object categories, failing to generalize to unseen categories.
- The authors propose a novel architecture that fuses learnable LiDAR features with frozen CLIP vision features, enabling predictions for both base and novel classes.
- Two novel loss functions are introduced: object-level distillation loss and voxel-level distillation loss, which improve classification performance on novel classes and leverage the CLIP model.
- Experiments on the nuScenes and SemanticKITTI datasets show the proposed method significantly outperforms a strong baseline by a large margin.
Thống kê
The nuScenes dataset consists of 1000 run segments with 16 semantic classes (10 things, 6 stuff).
The SemanticKITTI dataset has 19 semantic classes (8 things, 11 stuff).
Trích dẫn
"3D panoptic segmentation is a crucial task in computer vision with many real-world applications, most notably in autonomous driving."
"Existing models only predict panoptic segmentation results for a closed-set of objects. They fail to create predictions for the majority of unseen object categories in the scene, hindering the application of these algorithms to real-world scenarios, especially for autonomous driving."