核心概念
Providing taxonomy information enhances video instance segmentation performance across multiple datasets.
要約
In the realm of Video Instance Segmentation (VIS), training on large-scale datasets is crucial for performance improvement. However, annotated datasets for VIS are limited due to high labor costs. To address this challenge, a new model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) is proposed. This model leverages extra taxonomy information to help models focus on specific taxonomies, enhancing classification precision and mask precision. By incorporating a two-stage module consisting of Taxonomy Compilation Module (TCM) and Taxonomy Injection Module (TIM), TMT-VIS shows significant improvements over baseline solutions on popular benchmarks like YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. The approach sets new state-of-the-art records by effectively training and utilizing multiple datasets.
統計
"Our model shows significant improvement over the baseline solutions."
"Compared with Mask2Former-VIS [1] with the ResNet-50 backbone, our TMT-VIS gets absolute AP improvements of 3.3%, 4.3%, 5.8%, and 3.5% on the aforementioned challenging benchmarks, respectively."
"Compared with another high-performance solution VITA [3], our solution gets absolute AP improvements of 2.8%, 2.6%, 5.5%, and 3.1%, respectively."
引用
"Our main contributions can be summarized threefold: We analyze the limitations of existing video instance segmentation methods and propose a novel multiple-dataset training algorithm named TMT-VIS."
"We develop a two-stage module: Taxonomy Compilation Module (TCM) and Taxonomy Injection Module (TIM)."
"Our proposed TMT-VIS harvests great performance improvements over the baselines and sets new state-of-the-art records on multiple popular and challenging VIS datasets and benchmarks."