本文提出了一種名為 LSTM-CrossRWKV (LCR) 的新型視頻動作識別模型,該模型結合了 LSTM 和 Cross RWKV 的優勢,能夠有效地捕捉視頻中的時空特徵,並在多個基準數據集上取得了優異的性能。
LSTM CrossRWKV (LCR)は、従来のCNNやTransformerベースの手法の計算コストと長距離依存性の課題に対処する、効率的でスケーラブルな動画理解のための新しいフレームワークである。
This paper introduces LSTM CrossRWKV (LCR), a novel deep learning architecture for video action recognition that combines the strengths of LSTM networks for temporal modeling and a novel Cross RWKV gate for efficient integration of spatial and temporal information, achieving competitive performance with reduced computational complexity.
VideoMambaPro, an efficient alternative to transformer models, addresses the limitations of Mamba in video understanding tasks through masked backward computation and elemental residual connections, achieving state-of-the-art performance on video benchmarks.
TC-CLIP effectively and efficiently leverages comprehensive video information by extracting core information from each frame, interconnecting relevant information across the video to summarize into context tokens, and utilizing the context tokens during the feature encoding process. Additionally, the Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality.