The core message of this work is to effectively extract a clean and class-balanced subset from a noisy and long-tailed training dataset, which can be used to train a robust classification model.
The core message of this paper is that uninformative attention layers in vision transformers can be effectively integrated into their subsequent MLP layers, reducing computational load without compromising performance.
PEFTSmoothing leverages Parameter-Efficient Fine-Tuning (PEFT) methods to efficiently guide large-scale vision models like ViT to learn the noise-augmented data distribution, enabling the conversion of base models into certifiably robust classifiers.
The proposed Graph-based Vision Transformer (GvT) utilizes graph convolutional projection and talking-heads attention to effectively train on small datasets, outperforming convolutional neural networks and other vision transformer variants.
The authors propose a novel framework, Latent-based Diffusion Model for Long-tailed Recognition (LDMLR), that leverages the powerful generative capabilities of diffusion models to augment feature representations and address the challenge of long-tailed recognition in computer vision.
Deep neural networks used for image classification are vulnerable to adversarial attacks, which involve subtle manipulations of input data to cause misclassification. This study investigates the impact of FGSM and Carlini-Wagner attacks on three pre-trained CNN models, and examines the effectiveness of defensive distillation as a countermeasure.
The authors propose a novel framework for building Concept Bottleneck Models (CBMs) from pre-trained multi-modal encoders like CLIP. Their approach leverages Gumbel tricks and contrastive learning to create sparse and interpretable inner representations in the CBM, leading to significant improvements in accuracy compared to prior CBM methods.
Using synthetic images of future classes generated by pre-trained text-to-image diffusion models can significantly improve the performance of exemplar-free class incremental learning methods relying on a frozen feature extractor.
Large Language Models (LLMs) can provide valuable visual descriptions and knowledge to enhance the performance of pre-trained vision-language models like CLIP in low-shot image classification tasks.
The core message of this paper is to analyze and mitigate the inherent geographical biases present in state-of-the-art image classification models, in order to make them more robust and fair across different geographical regions and income levels.