Yin, Z., Li, C., & Dong, X. (2024). VIDEO RWKV:VIDEO ACTION RECOGNITION BASED RWKV (Preprint). arXiv:2411.05636v1 [cs.CV].
This paper introduces a novel deep learning model, LSTM CrossRWKV (LCR), designed to address the challenges of high computational costs and effectively capturing long-distance dependencies in video action recognition tasks.
The researchers propose a novel LCR framework that integrates an LSTM architecture with Cross RWKV Blocks for spatiotemporal representation learning. The model leverages a Cross RWKV gate to fuse past temporal information with current frame edge information, enhancing the focus on the subject through edge features and globally aggregating inter-frame features over time. Additionally, the model utilizes edge information as a forgetting gate for the LSTM, guiding long-term memory management. The researchers evaluate the effectiveness of LCR on three benchmark datasets for human action recognition: Kinetics-400, Something-Something V2, and Jester.
The study demonstrates the effectiveness of LCR as a scalable and efficient solution for video action recognition. By combining LSTM with Cross RWKV and incorporating edge information as a guiding prompt, the model effectively captures spatiotemporal dependencies while maintaining computational efficiency.
This research contributes to the field of computer vision by introducing a novel architecture for video understanding that addresses the limitations of existing methods in terms of computational complexity and long-range dependency modeling. The proposed LCR model offers a promising direction for developing more efficient and accurate video analysis tools.
The authors acknowledge the limitations in the model's parallel computation capability due to the use of the classical LSTM structure. Future research could explore scaling up the model to larger networks and investigate its application in video prediction and generation tasks.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania