State-of-the-art models struggle with Spacewalk-18 tasks, highlighting the need for improved video-language models.
Spacewalk-18 introduces a challenging benchmark for procedural video understanding, highlighting the difficulty of task recognition and segmentation in a multimodal and long-form context.