Generating Diverse and Coherent Future Frames by Conditioning on Geometry and Time
Leveraging large-scale pretrained 2D diffusion models, we can efficiently generate diverse and coherent future video frames by conditioning on past context frames and their timestamps.