ActNetFormer: A Transformer-ResNet Hybrid Approach for Efficient Semi-Supervised Action Recognition in Videos
ActNetFormer leverages both labeled and unlabeled video data to effectively learn robust action representations by combining cross-architecture pseudo-labeling and contrastive learning techniques. The framework integrates 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) to comprehensively capture spatial and temporal aspects of actions, achieving state-of-the-art performance in semi-supervised video action recognition tasks.