A novel multi-scale memory matching and decoding scheme with clip-based time-coded memory, and a transformation-aware loss, to effectively track objects undergoing complex transformations, small objects, and in long videos.
The proposed spatial-temporal multi-level association (STMA) framework efficiently learns dynamic, target-aware features by jointly associating reference frame, test frame, and object features, and utilizes a spatial-temporal memory bank to assist feature association and temporal ID assignment for accurate video object segmentation.
A novel end-to-end framework that leverages the unique properties of event cameras to enhance video object segmentation accuracy under low-light conditions.