In the realm of Video Instance Segmentation (VIS), training on large-scale datasets is crucial for performance improvement. However, annotated datasets for VIS are limited due to high labor costs. To address this challenge, a new model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) is proposed. This model leverages extra taxonomy information to help models focus on specific taxonomies, enhancing classification precision and mask precision. By incorporating a two-stage module consisting of Taxonomy Compilation Module (TCM) and Taxonomy Injection Module (TIM), TMT-VIS shows significant improvements over baseline solutions on popular benchmarks like YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. The approach sets new state-of-the-art records by effectively training and utilizing multiple datasets.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Rongkun Zhen... lúc arxiv.org 03-19-2024
https://arxiv.org/pdf/2312.06630.pdfYêu cầu sâu hơn