Conceptos Básicos
This paper proposes an efficient technique called Coarse-to-Fine Feature Mining (CFFM) to jointly learn local temporal contexts, including static and motional contexts, for video semantic segmentation. It also introduces an extension CFFM++ that further exploits global temporal contexts from the whole video to enhance the segmentation performance.
Resumen
The paper focuses on video semantic segmentation (VSS) and proposes two key techniques to effectively leverage temporal contexts:
Local Temporal Contexts:
CFFM learns a unified representation of static and motional contexts from neighboring video frames.
It uses a Coarse-to-Fine Feature Assembling (CFFA) module to organize features from nearby frames in a multi-scale manner, capturing both static and motional contexts.
The Cross-frame Feature Mining (CFM) module then mines useful contextual information from neighboring frames to enhance the target frame features.
Global Temporal Contexts:
CFFM++ extends CFFM by additionally exploiting global temporal contexts from the whole video.
It uniformly samples frames from the video, extracts global contextual prototypes using k-means clustering, and then uses CFM to refine the target frame features with the global contexts.
The proposed CFFM and CFFM++ techniques outperform state-of-the-art methods on popular VSS benchmarks, demonstrating the effectiveness of jointly learning local and global temporal contexts.
Estadísticas
The mIoU between ground-truth masks of consecutive video frames on the VSPW validation set is 89.7%, indicating high semantic consistency across neighboring frames.