MSQA: A Large-Scale Multi-Modal Dataset for Situated Reasoning in 3D Scenes and Benchmarking Tasks
This paper introduces MSQA, a large-scale dataset with interleaved multi-modal input for situated reasoning in 3D scenes, and proposes two benchmark tasks, MSQA and MSNN, to evaluate models' capability in situated reasoning and navigation.