Kernekoncepter
Attention Prompt Tuning (APT) enhances parameter efficiency and reduces computational complexity for video-based action recognition.
Resumé
Abstract:
APT introduces a computationally efficient variant of prompt tuning for video-based action recognition.
Videos require more tunable prompts for good results compared to images.
Introduction:
Video-based action recognition encodes temporal information crucial for identifying human activities.
Transformers have revolutionized various fields, including action recognition.
Method:
APT injects prompts directly into the non-local attention mechanism, reducing redundancy and computational complexity.
Prompt reparameterization technique enhances robustness to hyperparameter selection.
Experimental Setup:
Experiments conducted using ViT-Small and ViT-Base with VideoMAE pre-trained weights.
Results:
APT achieves superior performance with fewer tunable parameters compared to VPT and AdaptFormer.
Computational Complexity:
APT significantly reduces the number of tunable parameters and GFLOPs compared to VPT.
Main Analysis:
APT outperforms existing methods on UCF101, HMDB51, and SSv2 datasets.
Conclusion and Future Work:
APT establishes itself as a state-of-the-art method for parameter-efficient tuning in action recognition.
Statistik
Videos require hundreds of tunable prompts for good results.
APT achieves higher accuracy than full-tuning with only 200 attention prompts.
APT reduces the number of tunable parameters required for video-based applications.
Citater
"Videos require hundreds of tunable prompts to achieve good results."
"APT achieves higher accuracy than full-tuning with fewer tunable parameters."