Core Concepts
Large language models can be effectively leveraged as powerful action recognizers by projecting skeleton sequences into "action sentences" that are compatible with the models' pre-trained knowledge.
Abstract
The paper proposes a novel framework called LLM-AR that treats large language models as action recognizers for skeleton-based human action recognition. The key aspects are:
-
Linguistic Projection Process:
- An action-based VQ-VAE model is used to project each input skeleton sequence into a sequence of discrete tokens, forming an "action sentence".
- A human-inductive-biases-guided learning strategy is incorporated to make the "action sentences" more similar to human language sentences.
- A hyperbolic codebook is used in the VQ-VAE model to better represent the tree-like structure of human skeletons.
-
Large Language Model Integration:
- The pre-trained large language model (e.g., LLaMA) is used as the action recognizer, with its pre-trained weights kept untouched.
- Low-rank adaptation (LoRA) is performed to enable the large language model to understand the projected "action sentences".
The proposed LLM-AR framework consistently achieves state-of-the-art performance on multiple benchmark datasets for skeleton-based action recognition, including NTU RGB+D, NTU RGB+D 120, Toyota Smarthome, and UAV-Human. The framework also demonstrates promising results on recognizing actions from unseen classes, leveraging the rich pre-learned knowledge in the large language model.
Stats
The NTU RGB+D dataset contains around 56k skeleton sequences from 60 activity classes.
The NTU RGB+D 120 dataset contains more than 114k skeleton sequences across 120 activity classes.
The Toyota Smarthome dataset contains 16,115 video samples over 31 activity classes.
The UAV-Human dataset contains more than 20k skeleton sequences over 155 activity classes.
Quotes
"Motivated by this, in this work we are wondering, if we can also treat the large language model as an action recognizer in skeleton-based human action recognition?"
"Taking this into account, in this work, we aim to harness the large language model as an action recognizer, and at the same time keep its pre-trained weights untouched to preserve its pre-learned rich knowledge."