This paper proposes AntGPT, a framework that leverages large language models (LLMs) to address the long-term action anticipation (LTA) task from video observations. The key insights are:
Top-down LTA (goal-conditioned) can outperform bottom-up approaches by utilizing LLMs to infer the high-level goals from observed actions. The goal inference is achieved via in-context learning, which requires few human-provided examples.
The same action-based video representation allows LLMs to effectively model the temporal dynamics of human behaviors, achieving competitive performance without relying on explicitly inferred goals. This suggests LLMs can implicitly capture goal information when predicting future actions.
The useful prior knowledge encoded by LLMs can be distilled into a very compact neural network (1.3% of the original LLM model size), enabling efficient inference while maintaining similar or even better LTA performance.
The paper conducts extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA GAZE+ benchmarks, demonstrating the effectiveness of leveraging LLMs for both goal inference and temporal dynamics modeling in the LTA task.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Qi Zhao,Shij... ב- arxiv.org 04-02-2024
https://arxiv.org/pdf/2307.16368.pdfשאלות מעמיקות