VideoMAC proposes a new method for video representation learning by combining ConvNets and masked autoencoders. The framework demonstrates superior performance in various downstream tasks compared to existing ViT-based approaches. By introducing reconstruction consistency and utilizing sparse convolution, VideoMAC achieves efficient modeling of spatio-temporal data.
The study highlights the limitations of existing MVM methods based on isotropic ViT designs and emphasizes the benefits of using ConvNets for hierarchical pre-training. VideoMAC's architecture enables the integration of temporal information through an online-target encoder structure, reducing computational complexity while improving performance.
Ablation studies reveal the impact of different components such as encoder design, decoder depth, masking strategies, data settings, loss functions, and weight factors on the overall performance of VideoMAC. The framework shows promising results in image recognition tasks after pre-training on video data.
Overall, VideoMAC presents a compelling alternative to ViT-based methods for video representation learning, showcasing advancements in ConvNet-based MVM approaches.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Gensheng Pei... في arxiv.org 03-01-2024
https://arxiv.org/pdf/2402.19082.pdfاستفسارات أعمق