Główne pojęcia
Masked Token Modeling (MTM) can improve the storage efficiency of token-based vision model training by leveraging self-supervised pre-training, while TokenAdapt and ColorAdapt enhance the effectiveness of token-based data augmentation.
Streszczenie
The paper introduces a storage-efficient vision training framework called SeiT++ that combines Masked Token Modeling (MTM) with novel token augmentation strategies, TokenAdapt and ColorAdapt.
Key highlights:
- Recent advancements in deep neural networks have significantly improved performance across computer vision tasks, but achieving highly generalizable and high-performing vision models requires extensive datasets with large storage requirements.
- SeiT, a previous storage-efficient vision training framework, demonstrated the potential of token-based training, but lacked exploration beyond fully supervised learning scenarios.
- SeiT++ extends SeiT by integrating Masked Token Modeling (MTM) for self-supervised pre-training, which can take advantage of large-scale data beyond human-annotated datasets.
- TokenAdapt and ColorAdapt are proposed as novel token augmentation strategies to address the challenges of applying pixel-based data augmentations directly to tokens.
- Experiments on various scenarios, including storage-efficient ImageNet-1k classification, fine-grained classification, ADE-20k semantic segmentation, and robustness benchmarks, demonstrate the consistent performance improvement of SeiT++ over the baseline.
- The proposed methods also show extensibility to alternative input formats, such as DCT coefficients, highlighting their broad applicability.
Statystyki
The full ImageNet-1k dataset requires approximately 140GB for training with images.
SeiT++ achieves over 70% top-1 accuracy on ImageNet-1k with just 1GB of data.
SeiT++ outperforms the SeiT baseline by 1.5-6.7%p in top-1 accuracies on ImageNet-1k with varying storage sizes.
SeiT++ improves the top-1 accuracy on ImageNet-100 by 4.1-9.1%p compared to SeiT, especially in data-scarce scenarios.
Cytaty
"Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks. However, achieving highly generalizable and high-performing vision models requires expansive datasets, resulting in significant storage requirements."
"To leverage the advantages offered by large-scale data beyond human-annotated datasets, it is imperative to explore label-free learning methods."
"By integrating TokenAdapt and ColorAdapt into our token augmentation process, we enhance the effectiveness of token-based training across both fully supervised and unsupervised learning scenarios, demonstrating significant improvements in model performance."