Keskeiset käsitteet
VideoAgent utilizes a large language model as an agent to iteratively identify and compile crucial information in long-form videos, emphasizing interactive reasoning over direct visual processing.
Tilastot
Evaluated on EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average.