Enhancing Text-to-Video Generation with LLM-Grounded Dynamic Scene Layouts
Leveraging the ability of large language models (LLMs) to generate dynamic scene layouts (DSLs) that align with complex text prompts, we propose a training-free pipeline called LLM-grounded Video Diffusion (LVD) to significantly improve the text-video alignment of existing diffusion-based text-to-video models.