By leveraging human intention as a high-level guidance, the proposed framework can effectively anticipate long-term sequences of future human actions in egocentric videos.
Large language models can effectively infer high-level goals and model the temporal dynamics of human actions, enabling state-of-the-art performance on long-term action anticipation tasks.