Concepts de base
A novel prompt learning framework with optimal transport that effectively aligns multiple prompts to video features across different temporal scales, enabling robust and accurate few-shot temporal action localization.
Résumé
The paper introduces a novel approach to temporal action localization (TAL) in few-shot learning settings. It addresses the limitations of conventional single-prompt learning methods that often lead to overfitting due to the inability to generalize across varying contexts in real-world videos.
Key highlights:
- Proposes a multi-prompt learning framework enhanced with optimal transport to capture the diversity of camera views, backgrounds, and objects in videos.
- The multi-prompt design allows the model to learn a set of diverse prompts for each action, capturing general characteristics more effectively and distributing the representation to mitigate overfitting.
- Employs optimal transport theory to efficiently align these prompts with action features, optimizing for a comprehensive representation that adapts to the multifaceted nature of video data.
- Experiments demonstrate significant improvements in action localization accuracy and robustness in few-shot settings on THUMOS-14 and EpicKitchens100 datasets.
Stats
The moment a golfer swings and hits a ball in front of a fairway with trees.
The necessity for few-shot learning methods in temporal action localization (TAL) stems from the inherent challenge of annotating video data, as annotators must watch videos in their entirety to accurately label action instances.
Citations
"Current approaches to few-shot learning for temporal action localization take a meta-learning approach [34, 37], where each test video is aligned to a small subset of the training data in many 'episodes.' These methods require learning a model from initialization, with no priors, consuming large amounts of memory and compute."
"A recent training paradigm, prompt learning, has been used to reduce the number of trainable parameters, where all parameters of the model are fixed, and a learnable context vector is added to the prompt to improve the alignment between the prompt and the image features. [8,15,60,61]."