核心概念
Large language models can effectively infer high-level goals and model the temporal dynamics of human actions, enabling state-of-the-art performance on long-term action anticipation tasks.
摘要
This paper proposes AntGPT, a framework that leverages large language models (LLMs) to address the long-term action anticipation (LTA) task from video observations. The key insights are:
-
Top-down LTA (goal-conditioned) can outperform bottom-up approaches by utilizing LLMs to infer the high-level goals from observed actions. The goal inference is achieved via in-context learning, which requires few human-provided examples.
-
The same action-based video representation allows LLMs to effectively model the temporal dynamics of human behaviors, achieving competitive performance without relying on explicitly inferred goals. This suggests LLMs can implicitly capture goal information when predicting future actions.
-
The useful prior knowledge encoded by LLMs can be distilled into a very compact neural network (1.3% of the original LLM model size), enabling efficient inference while maintaining similar or even better LTA performance.
The paper conducts extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA GAZE+ benchmarks, demonstrating the effectiveness of leveraging LLMs for both goal inference and temporal dynamics modeling in the LTA task.
統計資料
"can better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after the current action (e.g. crack eggs)"
"the long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences"
"the LTA task is challenging due to noisy perception (e.g. action recognition), and the inherent ambiguity and uncertainty that reside in human behaviors"
引述
"Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after the current action (e.g. crack eggs)? What if the actor also shares the goal (e.g. make fried rice) with us?"
"We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives."
"Ideally, the prior knowledge can help both bottom-up and top-down LTA approaches, as they can not only answer questions such as 'what are the most likely actions following this current action?', but also 'what is the actor trying to achieve, and what are the remaining steps to achieve the goal?'"