The paper presents a novel method for customizing motion when generating videos from text prompts, addressing the underexplored aspect of motion representation. The key contributions are:
Motion Embeddings: The authors introduce a set of temporally coherent one-dimensional embeddings, termed Motion Embeddings, for explicit and efficient motion encoding. These embeddings are seamlessly integrated into the temporal transformer modules of video diffusion models, directly modulating the self-attention computations across frames.
Temporal Discrepancy: The authors identify the Temporal Discrepancy, which refers to the varied ways in which different motion modules in video generative models handle the temporal relationships between frames. This insight is leveraged to optimize the integration of the motion embeddings.
Experiments: Extensive experiments are conducted, demonstrating the effectiveness and flexibility of the proposed approach. The method outperforms existing motion transfer techniques in preserving the original video's motion trajectory and generating visual features that align with the provided text prompts.
The paper highlights the importance of motion representation in video generation and presents a novel solution that enables more creative, customized, and controllable video synthesis.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Luozhou Wang... lúc arxiv.org 04-01-2024
https://arxiv.org/pdf/2403.20193.pdfYêu cầu sâu hơn