The paper proposes a novel method called StoryDiffusion that can generate consistent images and videos from text prompts. The key components are:
Consistent Self-Attention: This training-free and pluggable attention module builds connections across images within a batch to maintain consistency in subject identity and attire during image generation. It is incorporated into pre-trained diffusion models to boost the consistency of the generated images.
Semantic Motion Predictor: This module encodes images into a semantic space to capture spatial information and then predicts the motion between a start and end frame. This enables the generation of smooth transition videos with large character movements, which is a limitation of previous methods that rely solely on temporal modeling in the image latent space.
By combining these two novel components, StoryDiffusion can generate consistent image sequences or videos that effectively narrate a story based on text prompts. The qualitative and quantitative evaluations demonstrate the superior performance of StoryDiffusion compared to recent state-of-the-art methods in both consistent image generation and transition video generation.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yupeng Zhou,... at arxiv.org 05-03-2024
https://arxiv.org/pdf/2405.01434.pdfDeeper Inquiries