Khái niệm cốt lõi
The authors propose a novel framework, Latent-based Diffusion Model for Long-tailed Recognition (LDMLR), that leverages the powerful generative capabilities of diffusion models to augment feature representations and address the challenge of long-tailed recognition in computer vision.
Tóm tắt
The paper addresses the problem of long-tailed recognition in computer vision, where some classes have significantly more samples than others in the training data. The authors propose a three-stage framework called LDMLR that utilizes diffusion models to generate pseudo-features and augment the training data.
In the first stage, the authors train a baseline neural network model on the long-tailed dataset and extract the encoded features. In the second stage, they train a class-conditional latent diffusion model (LDM) to generate pseudo-features for different classes. Finally, in the third stage, they fine-tune the classification head using both the encoded and pseudo-features.
The experiments on CIFAR-LT and ImageNet-LT datasets demonstrate that the proposed LDMLR framework can effectively improve the classification accuracy over various baseline methods, especially for the tail classes. The authors also conduct ablation studies to analyze the impact of different components, such as the augmentation ratio and the selection of classes for feature generation.
The key highlights of the paper are:
- Applying diffusion models for feature augmentation in long-tailed recognition, which is a novel approach.
- Proposing to perform the augmentation in the latent space to reduce computational cost and training time.
- Achieving significant improvements in classification accuracy over baseline methods on long-tailed datasets.
- Providing insights into the effectiveness of targeted feature augmentation for different class distributions (many, medium, few).
Thống kê
The paper reports the following key metrics:
CIFAR-10-LT with IF=100: Baseline WCDAS accuracy 84.67%, WCDAS+LDMLR accuracy 86.29% (↑1.62%)
CIFAR-100-LT with IF=100: Baseline WCDAS accuracy 50.95%, WCDAS+LDMLR accuracy 51.92% (↑0.97%)
ImageNet-LT: Baseline WCDAS accuracy 44.6%, WCDAS+LDMLR accuracy 44.8% (↑0.2%)
Trích dẫn
"Our method applies the diffusion model to enrich the feature embeddings for the long-tailed problem, offering a new solution to this challenging problem. To the best of my knowledge, we are first to explore the capability of diffusion model in the long-tailed recognition problem."
"When using the diffusion model, we propose to do the augmentation in the latent space instead of the image space, which reduces the computational cost and speeds up the training process."