The paper proposes a framework called "Cross-Modal Self-Training" (Cross-MoST) to improve the label-free classification performance of 3D vision models. The key ideas are:
Leverage unlabeled image-pointcloud pairs: The method utilizes unlabeled 3D point cloud data and their accompanying 2D views to learn better 3D representations without requiring any class-level annotations.
Student-teacher framework with joint pseudo-labels: The framework uses a student-teacher setup, where the teacher generates joint pseudo-labels by combining the predictions from both the image and pointcloud branches. The student is then trained on these joint pseudo-labels.
Cross-modal feature alignment: In addition to the pseudo-label based training, the method also enforces cross-modal feature alignment between the image and pointcloud representations to encourage them to learn complementary representations.
Masked modeling: The framework incorporates masked image and pointcloud modeling objectives to learn rich local features, which are then aligned with the global object-level representations.
The proposed Cross-MoST framework is evaluated on several synthetic and real-world 3D datasets, demonstrating significant performance gains over zero-shot 3D classification baselines as well as self-training on individual modalities. The method is able to leverage the complementary strengths of the image and pointcloud modalities to improve classification accuracy in a label-free setting.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Amaya Dharma... klo arxiv.org 04-17-2024
https://arxiv.org/pdf/2404.10146.pdfSyvällisempiä Kysymyksiä