Conceitos Básicos
An intra-task mutual attention method is proposed to enhance the feature representations of support and query sets in few-shot learning, enabling the model to effectively leverage both global and local information.
Resumo
The paper presents a novel few-shot learning framework called IMAformer that utilizes an intra-task mutual attention mechanism to improve the feature representations for support and query sets.
Key highlights:
- The input images are divided into patches and encoded using a pre-trained Vision Transformer (ViT) architecture. This allows the model to capture both global (CLS token) and local (patch tokens) information.
- The intra-task mutual attention method is introduced, where the patch tokens are swapped between the support and query sets. This enables the CLS token of the support set to focus on the detailed features of the query set, and vice versa.
- This process strengthens the intra-class representations and promotes closer proximity between instances of the same class, leading to more discriminative features.
- The proposed method is simple, efficient and effective, as it leverages self-supervised pre-trained ViT models and only requires fine-tuning a few parameters.
- Extensive experiments on five popular few-shot learning benchmarks demonstrate the superior performance of IMAformer compared to state-of-the-art methods.
Estatísticas
"Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples."
"For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge."
Citações
"Diverging from previous approaches, our method leverages a fusion of both global and local features from support and query sets, without resorting to external modules beyond the backbone architecture."
"Our method employs a Transformer architecture, incorporates the concept of intra-task mutual attention, utilizes the meta-learning method to fine-tune specific parameters based on the self-supervised learning pre-trained model training with Masked Image Modelling as pretext task, and leads to boosted results in few-shot learning tasks."