The paper aims to explore the extent of FVD's bias towards per-frame quality over temporal realism and identify its sources. The authors first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality. They find that the FVD increases only slightly with large temporal corruption, suggesting its bias towards the quality of individual frames.
The authors then analyze the generated videos and show that by carefully sampling from a large set of generated videos that do not contain motions, one can drastically decrease FVD without improving the temporal quality. This further confirms FVD's bias towards image quality.
The authors attribute this bias to the features extracted from a supervised video classifier trained on the content-biased dataset. They show that FVD with features extracted from recent large-scale self-supervised video models is less biased toward image quality.
Finally, the authors revisit a few real-world examples to validate their hypothesis. They find that FVD fails to capture temporal inconsistencies in long video generation, while using features from self-supervised models can better align with human perception.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Songwei Ge,A... às arxiv.org 04-19-2024
https://arxiv.org/pdf/2404.12391.pdfPerguntas Mais Profundas