The authors introduce SCENEVERSE, a million-scale 3D vision-language dataset, to address challenges in grounding language in 3D scenes and propose GPS, a pre-training framework that achieves state-of-the-art performance on existing benchmarks.
3D vision-language grounding is crucial for embodied agents, addressed by SCENEVERSE through data scaling and GPS pre-training.
3D Vision-Language learning is advanced through SCENEVERSE, a million-scale dataset, and GPS pre-training framework, achieving state-of-the-art results in 3D visual grounding benchmarks.