SFMViT: A Dual-Stream Spatiotemporal Feature Modeling Network for Robust Action Localization in Chaotic Scenes
The proposed SFMViT model effectively combines the global feature extraction capabilities of Vision Transformer (ViT) and the spatiotemporal modeling strengths of SlowFast to achieve state-of-the-art performance on the challenging Chaotic World dataset for spatiotemporal action localization. Additionally, the introduced Confidence Pruning Strategy optimizes anchor selection to further boost model accuracy.