Enhancing Text-to-Video Generation with Swapped Spatiotemporal Attention
The core message of this paper is that strengthening the interaction between spatial and temporal features is crucial for achieving high-quality text-to-video generation. The authors propose a novel Swapped spatiotemporal Cross-Attention (Swap-CA) mechanism that alternates the "query" role between spatial and temporal blocks, enabling mutual reinforcement for each other.