The author proposes MovieChat, integrating vision models and large language models, to overcome challenges in analyzing long videos by employing a memory mechanism represented by tokens in Transformers.
MovieChat proposes a novel memory mechanism to enhance long video understanding, achieving state-of-the-art performance.