The authors propose a simple but effective novel approach for text-to-video generation using grid diffusion models. The model consists of two main stages: (1) key grid image generation and (2) autoregressive grid image interpolation.
In the first stage, the key grid image generation model generates a key grid image that represents the video from the given text. The key grid image consists of four inside frames that capture the primary motions or events of the video. The authors fine-tune a pre-trained text-to-image model (Stable Diffusion) to generate the key grid image.
In the second stage, the autoregressive grid image interpolation model generates the output grid image conditioned on the previously generated grid image and the masked input grid image. This approach allows the model to maintain temporal consistency and generate videos with more than 28 frames.
The authors also explore extensions of their method, such as text-guided video manipulation and high-resolution video generation. Experimental results show that the proposed model outperforms existing text-to-video generation models in both quantitative and qualitative evaluations, while being more efficient in terms of GPU memory usage and requiring less training data.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania