The proposed D2ST-Adapter is a novel adapter tuning framework that can efficiently and effectively adapt large pre-trained vision models to few-shot action recognition tasks by encoding spatial and temporal features in a disentangled manner using anisotropic deformable spatio-temporal attention.
The proposed Multi-Velocity Progressive-Alignment (MVP-shot) framework learns and aligns multi-velocity action features to enable more robust and accurate few-shot action recognition, outperforming existing state-of-the-art methods.