Conceptos Básicos
FLATTEN introduces flow-guided attention to enhance visual consistency in text-to-video editing, seamlessly integrating with diffusion models for state-of-the-art performance.
Resumen
The paper introduces FLATTEN, a novel approach that integrates optical flow into the attention module of diffusion models for text-to-video editing. By enforcing patches on the same flow path across frames, FLATTEN improves visual consistency. Experimental results show superior performance in maintaining visual consistency compared to existing methods.
Most recent works focus on extending advanced diffusion models from text-to-image to text-to-video editing by inflating spatial self-attention into spatio-temporal self-attention. However, these methods struggle with maintaining visual consistency due to irrelevant information introduced during the inflation process.
FLATTEN addresses this challenge by leveraging optical flow to guide attention modules and ensure accurate communication of information across frames. By integrating FLATTEN into existing diffusion-based methods, high-quality and visually consistent text-to-video editing can be achieved without additional training or fine-tuning.
The study includes an ablation study to evaluate the contributions of dense spatio-temporal attention (DSTA) and FLATTEN individually and in combination. Results show that combining both modules significantly improves visual consistency in edited videos.
Additionally, user studies confirm that FLATTEN outperforms other methods in terms of semantic alignment, visual consistency, and motion preservation. The method is versatile and can be seamlessly integrated into various diffusion-based text-to-video editing frameworks for enhanced performance.
Estadísticas
"Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance."
"Our model achieves the new state-of-the-art performance on existing text-to-video editing benchmarks."
"Our method excels in maintaining the visual consistency in the edited videos."
Citas
"Our method excels in maintaining the visual consistency in the edited videos." - Research Paper