Conceitos Básicos
A computationally efficient method for open-set semantic localization and mapping that utilizes self-supervised vision transformer features (DINO) to augment geometric correspondence matching at the object level.
Resumo
The paper presents a novel system for tightly-coupled open-set semantic SLAM in sparse environments. It takes an off-the-shelf image encoding network (DINO) and runs post-processing steps to obtain instance-level object segmentations. It proposes a lightweight single vector encoding for each object and demonstrates that the object encoding is amenable to several data association methods in a factor graph-based SLAM framework.
The key highlights and insights are:
- A lightweight (sparse) open-set object representation using DINO features.
- A tightly-coupled open-set semantic SLAM system that uses the proposed object representation along with geometric information to improve the vehicle's positioning accuracy and vice-versa.
- Experimental results on collected and public datasets demonstrating that the proposed method can be used for more accurate and efficient data association and localization compared to dense methods, geometric only methods, and closed-set methods, while also providing more complete maps than closed-set methods.
- The method is computationally more efficient than competing dense methods and achieves high localization accuracy by reasoning at the object level rather than at the pixel or dense feature level.
- The maps from the method are more semantically consistent than dense methods, and it builds a more complete map and achieves higher localization accuracy than closed-set methods.
Estatísticas
The system uses noisy odometry measurements with Gaussian noise added to the relative odometry at each keyframe.