본 연구는 사용자가 독립적으로 카메라 움직임과 객체 움직임을 제어할 수 있는 텍스트 기반 동영상 생성 프레임워크를 제안한다.
The core message of this paper is that strengthening the interaction between spatial and temporal features is crucial for achieving high-quality text-to-video generation. The authors propose a novel Swapped spatiotemporal Cross-Attention (Swap-CA) mechanism that alternates the "query" role between spatial and temporal blocks, enabling mutual reinforcement for each other.
Cross-attention guidance can enable zero-shot control over object shape, position, and movement in text-to-video diffusion models, despite the limitations of current models.
Our novel grid diffusion models efficiently generate high-quality videos from text by reducing the temporal dimension of videos to the image dimension, enabling the use of various image-based methods for video tasks.