核心概念
MagicTime, a novel framework for generating high-quality metamorphic time-lapse videos that accurately depict real-world physical phenomena.
摘要
The paper introduces MagicTime, a framework for generating metamorphic time-lapse videos that capture the complete transformation process of objects, in contrast to the limited motion and variation exhibited by videos generated by existing text-to-video (T2V) models.
Key highlights:
- Metamorphic videos encode more comprehensive physical knowledge compared to general videos, making them challenging to generate.
- The authors propose the MagicAdapter scheme to decouple spatial and temporal training, allowing the model to better encode physical knowledge from metamorphic videos.
- A Dynamic Frames Extraction strategy is introduced to adapt to the characteristics of metamorphic time-lapse videos, which have wider variation ranges and cover dramatic object metamorphic processes.
- A Magic Text-Encoder is developed to improve the model's understanding of metamorphic video prompts.
- The authors curate a high-quality dataset called ChronoMagic, consisting of 2,265 time-lapse videos with detailed captions, to enable the training of metamorphic video generation models.
- Extensive experiments demonstrate the superiority of MagicTime in generating high-quality and consistent metamorphic time-lapse videos.
统计
"Recent advances in Text-to-Video (T2V) generation models have been driven by the emergence of diffusion models."
"Compared to general videos, we have observed a category of videos that typically encompasses the subject's entire transformation process, thus addressing the inherent limitations of the former. We term this category as metamorphic videos, which encode a more comprehensive representation of world knowledge."
"Time-lapse videos provide detailed documentation of an object's complete metamorphosis, possessing the essential characteristics of metamorphic videos."
引用
"Compared to general videos, metamorphic videos contain physical knowledge, long persistence, and strong variation, making them difficult to generate."
"We introduce MagicTime, a metamorphic time-lapse video generation diffusion model by adopting a standard T2V model with a MagicAdapter scheme. This addresses the limitations of existing methods that are unable to generate metamorphic videos."
"We propose an automatic metamorphic video captioning annotation framework. Utilizing this framework, we have curated a high-quality dataset named ChronoMigic, consisting of 2,265 time-lapse videos, each accompanied by a detailed caption."