Modular Reasoning for Effective Video Question Answering: Decomposing Planning and Execution for Interpretable and Accurate Multimodal Understanding
This paper proposes a multi-stage, modular reasoning framework called MoReVQA that decomposes video question answering into event parsing, grounding, and reasoning stages. This approach outperforms prior single-stage modular methods and a strong baseline, while providing interpretable intermediate outputs.