แนวคิดหลัก
The proposed hybrid visual correspondence (HVC) framework effectively combines static and dynamic cues to enable efficient and scalable self-supervised video object segmentation, outperforming state-of-the-art methods while requiring notably less training data and time.
บทคัดย่อ
The paper presents a novel self-supervised approach called hybrid visual correspondence (HVC) for video object segmentation (VOS). Unlike conventional video-based methods, HVC learns visual representations solely from static images, eliminating the need for labeled video data.
Key highlights:
- HVC integrates static and dynamic visual correspondence learning to capture both spatial and temporal consistency in images. This is achieved by:
- Establishing static correspondence using coordinate information between cropped image views.
- Introducing pseudo-dynamic signals between the cropped views to capture dynamic consistency.
- Proposing a hybrid visual correspondence loss to learn joint static and dynamic representations.
- HVC outperforms state-of-the-art self-supervised VOS methods on several benchmarks, including DAVIS16, DAVIS17, YouTube-VOS18, and VOST, while requiring significantly less training data and time.
- HVC also demonstrates competitive performance on additional video label propagation tasks, such as part segmentation and pose tracking.
- The authors provide extensive experiments and ablation studies to validate the effectiveness of the proposed approach.
สถิติ
HVC trained with 95K images achieves comparable performance to LIIR [4] delivered with 470K images.
HVC requires only 16GB GPU memory and 2 hours of training time, significantly less than existing self-supervised methods.
คำพูด
"HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model."
"Our approach, without bells and whistles, necessitates only one training session using static image data, significantly reducing memory consumption (∼16GB) and training time (∼2h)."