The paper introduces a novel task called "reasoning part segmentation" for 3D objects, which involves generating a segmentation mask based on complex and implicit textual queries about specific parts of a 3D object. To facilitate this task, the authors present a large 3D dataset called RPSeg3D, comprising over 60k instructions paired with corresponding ground-truth part segmentation annotations.
The authors propose PARIS3D, a multimodal large language model that can segment parts of 3D objects based on implicit textual queries, generate natural language explanations, and reason about 3D object properties and concepts. PARIS3D takes a 3D point cloud as input and renders it into multiple 2D images, which are then processed by a vision backbone and a multimodal language model. The model outputs a segmentation mask and a text response explaining its reasoning.
Experiments show that PARIS3D achieves competitive performance compared to models that use explicit queries, with the additional abilities to identify part concepts, reason about them, and complement them with world knowledge. The authors also demonstrate the generalizability of PARIS3D to real-world point cloud data.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Amrin Kareem... ב- arxiv.org 04-08-2024
https://arxiv.org/pdf/2404.03836.pdfשאלות מעמיקות