ActNetFormer leverages both labeled and unlabeled video data to effectively learn robust action representations by combining cross-architecture pseudo-labeling and contrastive learning techniques. The framework integrates 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) to comprehensively capture spatial and temporal aspects of actions, achieving state-of-the-art performance in semi-supervised video action recognition tasks.


coremsg

actnetformer-a-transformer-resnet-hybrid-approach-for-efficient-semi-supervised-action-recognition-in-videos


ActNetFormer: A Transformer-ResNet Hybrid Approach for Efficient Semi-Supervised Action Recognition in Videos