Основні поняття
This paper proposes a multi-stage, modular reasoning framework called MoReVQA that decomposes video question answering into event parsing, grounding, and reasoning stages. This approach outperforms prior single-stage modular methods and a strong baseline, while providing interpretable intermediate outputs.
Анотація
The paper addresses the task of video question answering (videoQA) by proposing a new multi-stage, modular reasoning framework called MoReVQA. The key insights are:
Event Parsing Stage:
- Focuses on parsing the input question to identify relevant events, temporal relationships, and question types.
- Generates a set of API calls based on the parsed information to populate the external memory.
Grounding Stage:
- Grounds the identified events from the previous stage in the video content.
- Leverages vision-language models to locate relevant temporal regions and entities in the video.
- Verifies and resolves any ambiguities in the grounding.
Reasoning Stage:
- Performs grounded reasoning on the video and question context stored in the external memory.
- Decomposes the original question into sub-questions to unravel different aspects.
- Combines the grounded video context and the outputs of the previous stages to generate the final answer.
The modular and multi-stage design of MoReVQA is shown to outperform prior single-stage modular methods as well as a strong baseline called JCEF, which simply captions every frame and uses an LLM to predict the answer. MoReVQA achieves state-of-the-art results on four standard videoQA benchmarks, while also providing interpretable intermediate outputs at each stage.
Статистика
The paper reports the following key metrics:
"why was the cat lying on its back near the end?"
[frame 42] what is the cat doing?: playing
[frame 48] what surrounds the cat?: a person
[frame 48] why was the cat lying on its back?: to be petted
Цитати
"The core issue of the overall system lies in the difficult task given to its single-stage planner: before performing visual reasoning, the model must output a full program without any grounding in the actual video itself."
"Through this decomposition, our MoReVQA model Mmulti-stage = {M1, M2, M3} relies on smaller focused prompts {P1, P2, P3} for each stage; furthermore, intermediate reasoning outputs {z1, z2, z3} are able to handle different aspects of the overall task, and incorporate grounding in the video itself to resolve ambiguities and inform new intermediate reasoning steps in a more effective manner than the ungrounded single-stage setting."