Toward Understanding Dynamic Scenes with Large Language Models: DoraemonGPT, a Comprehensive System for Video-based Tasks
DoraemonGPT is a comprehensive and conceptually elegant system driven by large language models (LLMs) to handle dynamic video tasks, including spatial-temporal reasoning, large planning space exploration, and external knowledge integration.