TraveLER is a modular multi-agent framework that iteratively traverses through a video, locates and extracts relevant information from keyframes through interactive question-answering, evaluates if there is enough information to answer the question, and replans if necessary.
BoViLA, a novel self-training framework, leverages the power of large language models (LLMs) to significantly improve video-language alignment and enhance performance in video question answering tasks.
Integrating domain-specific entity-action heuristics into video-language foundation models significantly improves their performance in Video Question Answering (VideoQA) tasks by enabling more precise and context-aware reasoning.
This research paper introduces MCG, a novel end-to-end model for VideoQA that leverages multi-granularity contrastive learning and cross-modal collaborative generation to achieve state-of-the-art performance in answering open-ended questions about long-term videos.