The paper addresses the problem of referring video object segmentation (RVOS), which aims to segment the target instance referred by a given text expression in a video clip. The authors observe that existing RVOS models often favor more on the action- and relation-related visual attributes of the instance, leading to partial or incorrect mask prediction.
To tackle this issue, the authors propose the LoSh framework, which takes both long and short text expressions as input. The short text expression retains only the appearance-related information of the target instance, allowing the model to focus on the instance's appearance. LoSh introduces a long-short cross-attention module to strengthen the feature corresponding to the long text expression via that from the short one. Additionally, a long-short predictions intersection loss is introduced to regulate the model predictions for the long and short text expressions.
Beyond the linguistic aspect, the authors also target to improve the model on the visual part by exploiting temporal consistency over visual features. A forward-backward visual consistency loss is introduced to compute optical flows between video frames and use them to warp the features of every annotated frame's temporal neighbors to the annotated frame for consistency optimization.
The proposed LoSh is built upon two state-of-the-art RVOS pipelines, MTTR and SgMg. Extensive experiments on A2D-Sentences, Refer-YouTube-VOS, JHMDB-Sentences and Refer-DAVIS17 datasets show that LoSh significantly outperforms the baselines across all metrics.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問