Core Concepts
BIVDiff is a training-free framework that bridges specific image diffusion models and general text-to-video diffusion models to enable flexible and efficient video synthesis for various tasks.
Abstract
The paper proposes a general training-free video synthesis framework called BIVDiff, which bridges specific image diffusion models and general text-to-video diffusion models. The key components are:
Frame-wise Video Generation: BIVDiff first uses a task-specific image diffusion model (e.g., ControlNet, Instruct Pix2Pix) to generate the target video in a frame-by-frame manner.
Mixed Inversion: To address the distribution shift between image and video diffusion models, BIVDiff introduces Mixed Inversion, which performs DDIM Inversion with both image and video diffusion models and mixes the inverted latents.
Video Temporal Smoothing: Finally, BIVDiff inputs the mixed latents into a video diffusion model (e.g., VidRD, ZeroScope) to achieve temporal smoothing and consistency.
This decoupled framework enables flexible model selection for different synthesis purposes, endowing BIVDiff with strong task generalization and high efficiency. Extensive experiments on a wide range of video synthesis tasks, including controllable video generation, video editing, video inpainting, and outpainting, demonstrate the effectiveness and general use of BIVDiff.
Stats
Training video synthesis models requires substantial computing resources.
Even with video foundation models, additional costly training is still required for downstream video synthesis tasks.
Extending image diffusion models to videos in a training-free manner cannot well preserve temporal consistency.
Previous methods are usually designed for one specific task and lack cross-task generality.
Quotes
"Make him on the road"
"A white swan moving on the lake"