핵심 개념
SHAPELLM is a 3D multimodal Large Language Model designed for embodied interaction, achieving state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks.
초록
SHAPELLM is a 3D multimodal Large Language Model designed for embodied interaction.
It focuses on 3D object understanding and interaction tasks.
The model utilizes RECON++ as a 3D point cloud input encoder.
SHAPELLM demonstrates superior performance in various tasks, including 3D captioning, 3D VQA, and embodied visual grounding.
The model is trained on constructed instruction-following data and tested on the 3D MM-Vet evaluation benchmark.
SHAPELLM sets new state-of-the-art representation transferring on downstream fine-tuned and zero-shot 3D object recognition tasks.
The model shows robust capabilities in knowledge representation, reasoning, and instruction-following dialogue.
SHAPELLM exhibits strong potential for real-world applicability and generalization to unseen objects.
통계
RECON++ achieves a remarkable accuracy of 95.25% on the ScanObjectNN benchmark.
SHAPELLM-13B surpasses previous best records by +5.1% on the 3D MM-Vet benchmark.
인용구
"SHAPELLM demonstrates superior performance in various tasks, including 3D captioning, 3D VQA, and embodied visual grounding."
"The model shows robust capabilities in knowledge representation, reasoning, and instruction-following dialogue."