Kernekoncepter
Feature prediction can serve as an effective stand-alone objective for unsupervised learning of versatile visual representations from video, outperforming pixel-level reconstruction approaches.
Resumé
The paper explores feature prediction as a stand-alone objective for unsupervised learning of visual representations from video. It introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision.
The key highlights and insights are:
Feature prediction leads to versatile visual representations that perform well across downstream image and video tasks without adapting the model's weights (i.e., using a frozen backbone). V-JEPA achieves the best performance among the methods considered on the motion-based Something-Something-v2 task, while also being competitive on appearance-based tasks like Kinetics-400.
Models trained with feature prediction are superior to pixel prediction approaches under a frozen evaluation protocol and are competitive with pixel prediction under full fine-tuning, while using significantly shorter training schedules.
Models trained with feature prediction are more label-efficient than pixel prediction approaches. Decreasing the available number of labeled examples results in an increase in the performance gap between V-JEPA and pixel-reconstruction models.
The paper conducts extensive ablations to identify key design choices, including the benefits of predicting in feature space versus pixel space, the impact of pretraining data distribution, the effectiveness of attentive pooling, and the importance of the masking strategy.
Qualitative analysis shows that the V-JEPA predictor network is able to generate consistent and plausible predictions for the masked regions of the video, demonstrating that the feature-space predictions are grounded in the visual input.
Statistik
"Feature prediction leads to versatile visual representations that perform well across downstream image and video tasks without adapting the model's weights (i.e., using a frozen backbone)."
"V-JEPA achieves the best performance among the methods considered on the motion-based Something-Something-v2 task, while also being competitive on appearance-based tasks like Kinetics-400."
"Models trained with feature prediction are superior to pixel prediction approaches under a frozen evaluation protocol and are competitive with pixel prediction under full fine-tuning, while using significantly shorter training schedules."
"Models trained with feature prediction are more label-efficient than pixel prediction approaches. Decreasing the available number of labeled examples results in an increase in the performance gap between V-JEPA and pixel-reconstruction models."
Citater
"Feature prediction can serve as an effective stand-alone objective for unsupervised learning of versatile visual representations from video, outperforming pixel-level reconstruction approaches."
"V-JEPA achieves the best performance among the methods considered on the motion-based Something-Something-v2 task, while also being competitive on appearance-based tasks like Kinetics-400."
"Models trained with feature prediction are superior to pixel prediction approaches under a frozen evaluation protocol and are competitive with pixel prediction under full fine-tuning, while using significantly shorter training schedules."