The paper proposes a general training-free video synthesis framework called BIVDiff, which bridges specific image diffusion models and general text-to-video diffusion models. The key components are:
Frame-wise Video Generation: BIVDiff first uses a task-specific image diffusion model (e.g., ControlNet, Instruct Pix2Pix) to generate the target video in a frame-by-frame manner.
Mixed Inversion: To address the distribution shift between image and video diffusion models, BIVDiff introduces Mixed Inversion, which performs DDIM Inversion with both image and video diffusion models and mixes the inverted latents.
Video Temporal Smoothing: Finally, BIVDiff inputs the mixed latents into a video diffusion model (e.g., VidRD, ZeroScope) to achieve temporal smoothing and consistency.
This decoupled framework enables flexible model selection for different synthesis purposes, endowing BIVDiff with strong task generalization and high efficiency. Extensive experiments on a wide range of video synthesis tasks, including controllable video generation, video editing, video inpainting, and outpainting, demonstrate the effectiveness and general use of BIVDiff.
Til et annet språk
fra kildeinnhold
arxiv.org
Viktige innsikter hentet fra
by Fengyuan Shi... klokken arxiv.org 04-10-2024
https://arxiv.org/pdf/2312.02813.pdfDypere Spørsmål