Core Concepts
A novel graph-based approach that leverages view consensus to effectively merge 2D masks into 3D instances, achieving state-of-the-art performance in open-vocabulary 3D instance segmentation.
Abstract
The paper proposes a novel approach for open-vocabulary 3D instance segmentation, which tackles the challenge of segmenting 3D instances without predefined categories.
The key contributions are:
A novel graph-based methodology to merge 2D masks into 3D instances, utilizing a global "view consensus rate" metric to assess the relationship between 2D masks.
An efficient algorithm to compute the view consensus rate, which measures the proportion of frames supporting the merging of two 2D masks.
An iterative graph clustering process that prioritizes merging mask pairs with high view consensus, yielding the final 3D instance proposals.
A feature fusion mechanism that aggregates open-vocabulary semantics from the associated 2D masks for each 3D instance.
The proposed method is extensively evaluated on public benchmarks, including ScanNet++, ScanNet200, and MatterPort3D, demonstrating state-of-the-art performance in open-vocabulary 3D instance segmentation, particularly in segmenting fine-grained objects.
Stats
"Green plant on the marble table" and "White clothes on the sofa chair" are examples of open-vocabulary queries that the method aims to address.
The method is evaluated on publicly available datasets including ScanNet++, ScanNet200, and MatterPort3D.