Core Concepts
TraveLER is a modular multi-agent framework that iteratively traverses through a video, locates and extracts relevant information from keyframes through interactive question-answering, evaluates if there is enough information to answer the question, and replans if necessary.
Abstract
The content introduces TraveLER, a novel framework for video question answering (VideoQA) that utilizes a multi-agent approach. The key components of the framework are:
Traversal: An agent creates a plan to traverse through the video and collect relevant information to answer the question.
Locator: This component has two sub-modules - the Retriever selects the next timestamps to view based on the plan, and the Extractor generates context-dependent questions about the selected frames and extracts answers using a vision-language model.
Evaluator: This agent reviews the collected information, determines if there is enough to answer the question, and decides whether to output the answer or replan and start a new iteration.
The framework is designed to address limitations of existing approaches that use image-based models for VideoQA. These methods often overlook how keyframes are selected and cannot adjust when incorrect timestamps are identified. Moreover, they provide general descriptions of frames rather than extracting details relevant to the question.
In contrast, TraveLER iteratively collects information, allowing it to adaptively focus on relevant parts of the video and extract fine-grained details to answer the question. Extensive experiments show that TraveLER outperforms state-of-the-art zero-shot VideoQA methods on benchmarks like NExT-QA, STAR, and Perception Test.
Stats
The content does not contain any explicit numerical data or statistics. It focuses on describing the high-level architecture and components of the TraveLER framework.
Quotes
The content does not contain any striking quotes that support the key logics.