In this paper, a new data-efficient pre-training method for event cameras is introduced, focusing on voxel-based self-supervised learning. The approach addresses the challenges of sparse and non-uniform event data by utilizing semantic-uniform masking and disentangled masked modeling. By decomposing the reconstruction task into local spatio-temporal details and global semantics, the proposed method achieves superior generalization performance across various tasks with fewer parameters and lower computational costs.
The content discusses the benefits of event cameras over traditional cameras, highlighting their low latency, high dynamic range, and low power consumption. It emphasizes the lack of labeled data as a bottleneck in event camera applications and introduces self-supervised learning methods tailored for event data to address this issue. The paper proposes a voxel-based backbone for pre-training without relying on paired RGB images, showcasing significant improvements in performance across different tasks.
The study includes an overview of the proposed pre-training framework consisting of voxelization, grouping, masking, encoder architecture, and disentangled reconstruction branches. It explores the effectiveness of each reconstruction branch and demonstrates the advantages of disentangled masked modeling in terms of training efficiency and data efficiency. Ablation studies confirm that semantic-uniform masking enhances both local and global reconstruction branches.
Furthermore, experimental results show that the proposed method outperforms existing state-of-the-art models in object recognition, detection, semantic segmentation, and action recognition tasks. The approach is validated to be highly practical with very few parameters and computational requirements while achieving superior performance.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Zhenpeng Hua... ב- arxiv.org 03-04-2024
https://arxiv.org/pdf/2403.00416.pdfשאלות מעמיקות