This paper proposes a multi-stage, modular reasoning framework called MoReVQA that decomposes video question answering into event parsing, grounding, and reasoning stages. This approach outperforms prior single-stage modular methods and a strong baseline, while providing interpretable intermediate outputs.
The core message of this paper is to propose a neural-symbolic framework called Neural-Symbolic VideoQA (NS-VideoQA) that enables effective compositional spatio-temporal reasoning for real-world video question answering tasks. The framework transforms unstructured video data into a symbolic representation capturing persons, objects, relations, and action chronologies, and then performs iterative symbolic reasoning to answer compositional questions.
Current vision-language models excel at video question answering but struggle to ground their predictions in relevant video content, often relying on language shortcuts and irrelevant visual context.
Efficient sampling methods, Most Implied Frames (MIF) and Most Dominant Frames (MDF), are proposed to boost the performance of image-text models on video question answering tasks.
VideoDistill generates answers solely from question-related visual embeddings by employing a language-aware gating mechanism to enable goal-driven visual perception and answer generation, distinguishing it from previous video-language models that directly fuse language into visual representations.