Comprehensive Evaluation of Video Large Multi-modal Models' Reasoning and Robustness Capabilities in Complex Real-World Scenarios
Video-LMMs struggle to correctly comprehend complex videos, indicating their weak reasoning and lack of robustness to textual user queries. A training-free Dual-Step Contextual Prompting (DSCP) technique can effectively enhance the performance of existing Video-LMMs.