The paper introduces the 3D Lifting Foundation Model (3D-LFM), a novel approach to single-frame 2D-3D lifting that can handle a wide range of object categories. Key highlights:
3D-LFM leverages the property of permutation equivariance in transformers to process input 2D keypoints without requiring semantic correspondences across 3D training data. This allows the model to adapt to diverse object categories and configurations.
The integration of Tokenized Positional Encoding (TPE) and a hybrid local-global attention mechanism within the graph-based transformer architecture enhances the model's scalability and ability to handle imbalanced datasets.
3D-LFM outperforms specialized methods on benchmark datasets like H3WB, demonstrating state-of-the-art performance on human body, face, and hand categories without the need for object-specific designs.
The model exhibits strong generalization capabilities, successfully handling out-of-distribution (OOD) object categories and rig configurations not seen during training, showcasing its potential as a foundational 2D-3D lifting model.
Ablation studies validate the importance of the Procrustean alignment, hybrid attention, and TPE components in enabling 3D-LFM's scalability and OOD generalization.
翻譯成其他語言
從原文內容
arxiv.org
深入探究