Existing 3D vision-language models struggle with natural language variations, highlighting a need for improved robustness.
Existing 3D-VL models struggle with natural language variations, highlighting the need for improved language robustness.