ActNetFormer leverages both labeled and unlabeled video data to effectively learn robust action representations by combining cross-architecture pseudo-labeling and contrastive learning techniques. The framework integrates 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) to comprehensively capture spatial and temporal aspects of actions, achieving state-of-the-art performance in semi-supervised video action recognition tasks.
A novel framework named LaIAR that leverages knowledge from language models to enhance both the recognition capabilities and the interpretability of video models.
SNRO, a novel framework for video class-incremental learning, slightly shifts the features of new classes during their training stage to greatly improve the performance of old classes, while consuming the same memory as existing methods.
UNITE introduces a novel approach to unsupervised video domain adaptation, leveraging masked pre-training and collaborative self-training to achieve significant performance improvements across domains.