Core Concepts
A unified framework, VimTS, that leverages synergy between various text spotting tasks and scenarios to enhance the generalization ability of the model.
Abstract
The paper introduces VimTS, a unified framework for text spotting that aims to enhance the generalization ability of the model by achieving better synergy among different tasks.
Key highlights:
- Proposes a Prompt Queries Generation Module (PQGM) and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters.
- Introduces VTD-368k, a large-scale synthetic video text dataset, derived from high-quality open-source videos, designed to improve text spotting models.
- In image-level cross-domain text spotting, VimTS demonstrates superior performance compared to state-of-the-art methods across six benchmarks, with an average improvement of 2.6% in Hmean.
- In video-level cross-domain text spotting, VimTS surpasses the previous end-to-end approach in both ICDAR2015 video and DSText v2 by an average of 5.5% on MOTA metric, achieved solely through the utilization of image-level data.
Stats
"Text spotting faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization."
"Existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data."
"The synthetic video data is more consistent with the text between frames by utilizing the CoDeF to facilitate the achievement of realistic and stable text flow propagation."
Quotes
"Text spotting has shown promising progress, addressing cross-domain text spotting remains a significant challenge that requires further exploration."
"Videos incorporating dynamic factors like occlusions, swift changes in scenes, and temporal relationships between frames, the involvement of additional video text tracking tasks diminishes the effectiveness and accuracy of text spotters designed for still images."
"By utilizing the CoDeF, our method facilitates the achievement of realistic and stable text flow propagation, significantly reducing the occurrence of distortions."