The paper presents VideoMambaPro, an efficient alternative to transformer models for video understanding tasks. The authors first analyze the differences between self-attention in transformers and the token processing in Mamba models. They identify two key limitations of Mamba models when applied to video understanding: historical decay and element contradiction.
To address the historical decay issue, the authors propose using masked backward computation in the bi-directional Mamba process, which eliminates the duplicate similarity on the diagonal without affecting other elements. To tackle the element contradiction problem, they introduce residual connections to Mamba's matrix elements, distributing the requirement for the Ai parameter across multiple Ai values and avoiding contradictions caused by interleaved sequence structures.
The resulting VideoMambaPro framework builds upon the VideoMamba architecture and consistently outperforms the original VideoMamba model on video action recognition benchmarks, including Kinetics-400, Something-Something V2, UCF-101, and HMDB51. Compared to state-of-the-art transformer models, VideoMambaPro achieves competitive or superior performance, while being significantly more efficient in terms of parameters and FLOPs. For example, on Kinetics-400, VideoMambaPro-M achieves 91.9% top-1 accuracy, only 0.2% below the recent InternVideo2-6B model, but with only 1.2% of the parameters.
The authors conclude that the combination of high performance and efficiency makes VideoMambaPro a promising alternative to transformer models for video understanding tasks.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor