Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding
מושגי ליבה
State-of-the-art models struggle with Spacewalk-18 tasks, highlighting the need for improved video-language models.
תקציר
The Spacewalk-18 benchmark introduces tasks of step recognition and intra-video retrieval in the unique domain of International Space Station spacewalk recordings. State-of-the-art models perform poorly on these tasks, emphasizing the challenges in generalization to new domains and multimodal understanding. Human evaluations outperform models, showcasing the importance of incorporating multimodal content and long-term context. The dataset contains densely annotated videos with structured temporal representations, providing a novel challenge for video understanding systems.
Spacewalk-18
סטטיסטיקה
Spacewalk-18 exposes high difficulty in task recognition and segmentation.
State-of-the-art methods perform poorly on the benchmark.
Incorporating visual and text modalities improves task performance.
ציטוטים
"We find that state-of-the-art methods perform poorly on our benchmark."
"Our experiments underscore the need to develop new approaches to these tasks."
"Both multimodal information and long-term video context are essential to solve the tasks."