The paper introduces TrailBlazer, a method for enhancing controllability in diffusion-based text-to-video (T2V) synthesis. The key contributions are:
Novelty: TrailBlazer uses high-level bounding boxes to guide the subject's position, size, and behavior in the synthesized video, avoiding the need for low-level control signals like edge maps or detailed masks.
Position, size, and prompt trajectory control: Users can control the subject's position by keyframing bounding boxes, adjust the subject's size to produce perspective effects, and keyframe the text prompt to influence the subject's behavior and identity.
Simplicity: TrailBlazer operates by directly editing the spatial and temporal attention in a pre-trained diffusion model, requiring no training or optimization.
The method builds upon a pre-trained T2V model, ZeroScope, and introduces spatial and temporal attention editing to guide the subject's trajectory and appearance. Spatial attention is edited within the user-specified bounding box, while temporal attention is edited to control the subject's movement and size changes over time. The resulting videos exhibit natural motion, perspective effects, and seamless integration of the subject within the environment.
The paper presents qualitative and quantitative evaluations, comparing TrailBlazer to existing methods like T2V-Zero and Peekaboo. The results demonstrate TrailBlazer's ability to produce high-quality, controllable video synthesis without the need for low-level guidance or additional training.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania