The core message of this paper is to introduce a novel perception-oriented video frame interpolation paradigm called PerVFI, which tackles the challenges of blur and ghosting artifacts by incorporating an asymmetric synergistic blending module and a conditional normalizing flow-based generator.
The proposed ColorMNet effectively explores spatial-temporal features for video colorization by: 1) using a large-pretrained visual model to guide the estimation of robust spatial features for each frame, 2) developing a memory-based feature propagation module to adaptively propagate useful features from far-apart frames, and 3) exploiting the similar contents of adjacent frames through a local attention module.
The proposed Space-Time Neural Operator (STNO) effectively extracts fine-grained spatiotemporal representations from coarse-grained intra-frame features by modeling the task as a mapping between two continuous function spaces. The Galerkin-type attention mechanism in STNO enables precise and efficient motion estimation and compensation, particularly for large motions.
A novel training-free approach called Temporal-Aware Cluster-based SUMmarization (TAC-SUM) that leverages temporal relations between video frames to generate concise and coherent video summaries.
The proposed guided slot attention network leverages guided slots, feature aggregation transformer, and K-nearest neighbors filtering to effectively separate foreground and background spatial structural information, achieving state-of-the-art performance on challenging video object segmentation datasets.
The proposed FMA-Net framework effectively handles spatio-temporally-variant degradations in blurry low-resolution videos through flow-guided dynamic filtering and iterative feature refinement with multi-attention.
BlazeBVD introduces a novel approach to blind video deflickering, leveraging histogram-assisted solutions to enhance temporal consistency and eliminate flickering artifacts.
Improving legibility of small motions through axial motion magnification.
Efficiently process long video sequences using a text-conditioned resampler for improved performance in various tasks.
Integrating Visual Language Models with Vision Transformers enhances video action understanding by aligning spatio-temporal representations.