Enhancing Referring Video Object Segmentation by Leveraging Long and Short Text Expressions
The core message of this work is to enhance referring video object segmentation (RVOS) by leveraging both long and short text expressions. The authors propose a long-short text joint prediction network (LoSh) that utilizes a long-short cross-attention module and a long-short predictions intersection loss to better align the linguistic and visual information. Additionally, a forward-backward visual consistency loss is introduced to exploit temporal consistency in the visual features.