The paper introduces Lift3D, a method that can universally transfer any pre-trained 2D vision model to produce view-consistent 3D predictions. The key insights are:
The intermediate feature maps of modern 2D vision models are roughly aligned with the input image, suggesting that we only need to rectify inconsistencies and propagate labels from supporting views to novel views to achieve view consistency.
Lift3D builds upon image-based rendering techniques, where it learns to aggregate pixels with epipolar constraints to synthesize novel views. By viewing dense features as colors, the method can interpolate novel views on a feature space generated by a pre-trained 2D visual model.
To address the inherent inconsistencies in the 2D feature maps, Lift3D employs a two-stage aggregation strategy. It first performs a correction on the epipolar features obtained from the 2D visual model, using the view-consistent RGB features as a guide.
Lift3D can be pre-trained on just a few 2D vision models (DINO and CLIP) and then directly applied to unseen scenes and 2D vision operators during inference, demonstrating strong zero-shot generalization abilities.
The authors demonstrate the versatility of Lift3D by applying it to a variety of 3D vision tasks, including semantic segmentation, style transfer, scene editing, open vocabulary segmentation, and image colorization. In many cases, Lift3D outperforms state-of-the-art methods specialized for the specific task.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Mukund Varma... at arxiv.org 03-29-2024
https://arxiv.org/pdf/2403.18922.pdfDeeper Inquiries