The study investigates the ability of Stable Diffusion and other large-scale models to understand physical properties of 3D scenes. It introduces a probing protocol to evaluate features' effectiveness in predicting scene attributes like geometry, material, lighting, and occlusion. Results show that Stable Diffusion excels in certain areas but struggles with others, highlighting potential applications and limitations in scene analysis.
Recent advancements in generative models have led to remarkable image quality improvements. The study aims to assess how well diffusion networks model 3D scenes by evaluating their understanding of various properties. By training discriminative classifiers on diffusion features, the researchers probe scene attributes such as geometry, material, support relations, lighting, and depth.
The investigation reveals that Stable Diffusion performs well in discriminating certain properties like scene geometry and support relations but shows lower performance for occlusion and material prediction. The study also extends the evaluation to other large-scale networks like DINOv2, OpenCLIP, CLIP, and VQGAN to compare their performance against Stable Diffusion.
Key metrics used to support the argument include ROC AUC scores obtained through linear probing of features extracted from different layers and time steps of the models. The results highlight the strengths and weaknesses of each model in understanding diverse physical properties of 3D scenes.
翻譯成其他語言
從原文內容
arxiv.org
深入探究