In the realm of Video Instance Segmentation (VIS), training on large-scale datasets is crucial for performance improvement. However, annotated datasets for VIS are limited due to high labor costs. To address this challenge, a new model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) is proposed. This model leverages extra taxonomy information to help models focus on specific taxonomies, enhancing classification precision and mask precision. By incorporating a two-stage module consisting of Taxonomy Compilation Module (TCM) and Taxonomy Injection Module (TIM), TMT-VIS shows significant improvements over baseline solutions on popular benchmarks like YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. The approach sets new state-of-the-art records by effectively training and utilizing multiple datasets.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Rongkun Zhen... às arxiv.org 03-19-2024
https://arxiv.org/pdf/2312.06630.pdfPerguntas Mais Profundas