Core Concepts
OneVOS proposes a unified framework for Video Object Segmentation using an All-in-One Transformer, achieving state-of-the-art performance.
Abstract
OneVOS introduces a novel approach to Video Object Segmentation by unifying core components into a Vision Transformer. The framework integrates feature extraction, matching, memory management, and object aggregation efficiently. By utilizing the All-in-One Transformer and Unidirectional Hybrid Attention mechanism, OneVOS achieves superior performance across various datasets. The Dynamic Token Selector enhances efficiency by selecting tokens adaptively. Extensive experiments demonstrate the effectiveness of OneVOS in complex scenarios.
Stats
Achieved 70.1% J&F score on LVOS dataset.
Surpassed previous methods by 4.2% and 7.0% on LVOS and MOSE datasets respectively.