The paper introduces FLATTEN, a novel approach that integrates optical flow into the attention module of diffusion models for text-to-video editing. By enforcing patches on the same flow path across frames, FLATTEN improves visual consistency. Experimental results show superior performance in maintaining visual consistency compared to existing methods.
Most recent works focus on extending advanced diffusion models from text-to-image to text-to-video editing by inflating spatial self-attention into spatio-temporal self-attention. However, these methods struggle with maintaining visual consistency due to irrelevant information introduced during the inflation process.
FLATTEN addresses this challenge by leveraging optical flow to guide attention modules and ensure accurate communication of information across frames. By integrating FLATTEN into existing diffusion-based methods, high-quality and visually consistent text-to-video editing can be achieved without additional training or fine-tuning.
The study includes an ablation study to evaluate the contributions of dense spatio-temporal attention (DSTA) and FLATTEN individually and in combination. Results show that combining both modules significantly improves visual consistency in edited videos.
Additionally, user studies confirm that FLATTEN outperforms other methods in terms of semantic alignment, visual consistency, and motion preservation. The method is versatile and can be seamlessly integrated into various diffusion-based text-to-video editing frameworks for enhanced performance.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yuren Cong,M... at arxiv.org 03-04-2024
https://arxiv.org/pdf/2310.05922.pdfDeeper Inquiries