Adapting Short-Term Vision Transformers for Temporal Action Detection in Untrimmed Videos
This paper introduces ViT-TAD, a simple and effective end-to-end temporal action detection framework that adapts pre-trained short-term Vision Transformers (ViTs) to model long-form videos by incorporating inner-backbone and post-backbone information propagation modules.