O2V-mapping enables online construction of dense open-vocabulary scenes by grounding language embeddings from text-image models into voxel-based neural implicit representation, addressing issues of semantic ambiguity and multi-view inconsistency.
SOLE, a semantic and geometric-aware visual-language learning framework, can directly segment 3D objects from point clouds with strong generalizability by leveraging multimodal information and associations.
Open3DIS is a novel method that combines 3D instance proposals from a class-agnostic 3D segmentation network with 2D instance masks from a 2D open-vocabulary segmentation model to achieve high-quality 3D instance segmentation for both seen and unseen object classes.