The authors propose the LVOS benchmark, a novel large-scale dataset for long-term video object segmentation. LVOS contains 720 videos with an average duration of 1.14 minutes, which is significantly longer than existing VOS datasets (typically 3-10 seconds). The dataset is densely and accurately annotated, with a total of 296,401 frames and 407,945 annotations.
LVOS is designed to better reflect real-world scenarios, with videos exhibiting various challenges such as long-term reappearance of objects, cross-temporal confusion, and small objects. The authors evaluate 20 existing VOS models on LVOS under four different settings: semi-supervised, unsupervised single object, unsupervised multiple object, and interactive VOS.
The results show a significant performance drop for these models on LVOS compared to their performance on short-term video datasets. Through attribute-based analysis and visualization of prediction results, the authors identify that the primary factors contributing to the accuracy decline are the increased video length, complex motion, large scale variations, frequent disappearances, and similar background confusion.
The authors also explore potential avenues for improving the performance of VOS models on long-term videos, such as retraining the models on the diverse scenes in LVOS and addressing the issue of error accumulation over time. The LVOS benchmark and the comprehensive analysis provided in this work aim to advance the development of robust VOS models capable of handling real-world scenarios.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы