Core Concepts
Jointly leveraging high-level semantics and low-level temporal correspondence enhances object-centric perception in videos.
Abstract
This article introduces a novel self-supervised framework that combines semantic discrimination and temporal correspondence to improve object-centric analysis. The model uses RGB feature maps and dense feature correlation, fused into semantic-aware masked slot attention with Gaussian distributions for semantic decomposition and instance identification. The method achieves state-of-the-art results on label propagation tasks and unsupervised object discovery.
Introduction
Humans distinguish objects through high-level semantics and low-level temporal correspondence.
Computer vision aims to equip machines with these capabilities for object-centric perception.
Early works relied on human annotations or weak supervision, limiting generalization ability.
Recent unsupervised methods show promising results in learning robust representations.
Method
Frame-wise features are extracted using a visual encoder from an RGB sequence.
Dense feature correlation is calculated for temporal correspondence representations.
Semantic-aware masked slot attention is used for semantic decomposition and instance identification.
Training involves self-supervision with temporal consistency objectives.
Experiments
Trained on YouTube-VOS dataset, evaluated on various benchmarks for single and multiple object discovery.
Achieved promising results on label propagation tasks, demonstrating the effectiveness of the proposed method.
Results
Outperformed existing methods in single object discovery without post-processing.
Significantly improved multiple object discovery performance compared to unsupervised baselines.
State-of-the-art results obtained on label propagation tasks across different datasets.
Further Discussion
Ablation studies conducted on different aspects of the model, including feature usage, learning objectives, frame sampling, and limitations.
Conclusion
The proposed framework effectively leverages semantics and temporal correspondence for enhanced object-centric perception in videos.
Stats
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery.
We achieve state-of-the-art performance on dense label propagation tasks.
Quotes
"Most of the existing works only concentrate on one of these features."
"Our contributions are: (1) We propose a novel self-supervised architecture that unifies semantic discrimination and temporal correspondence."