The paper proposes a novel method called StoryDiffusion that can generate consistent images and videos from text prompts. The key components are:
Consistent Self-Attention: This training-free and pluggable attention module builds connections across images within a batch to maintain consistency in subject identity and attire during image generation. It is incorporated into pre-trained diffusion models to boost the consistency of the generated images.
Semantic Motion Predictor: This module encodes images into a semantic space to capture spatial information and then predicts the motion between a start and end frame. This enables the generation of smooth transition videos with large character movements, which is a limitation of previous methods that rely solely on temporal modeling in the image latent space.
By combining these two novel components, StoryDiffusion can generate consistent image sequences or videos that effectively narrate a story based on text prompts. The qualitative and quantitative evaluations demonstrate the superior performance of StoryDiffusion compared to recent state-of-the-art methods in both consistent image generation and transition video generation.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Yupeng Zhou,... klo arxiv.org 05-03-2024
https://arxiv.org/pdf/2405.01434.pdfSyvällisempiä Kysymyksiä