核心概念
The proposed Graph-based Vision Transformer (GvT) utilizes graph convolutional projection and talking-heads attention to effectively train on small datasets, outperforming convolutional neural networks and other vision transformer variants.
摘要
The paper introduces a new architecture for vision transformers called Graph-based Vision Transformer (GvT) that can train from scratch on small datasets and achieve state-of-the-art performance.
Key highlights:
- GvT treats the image as graph data and uses graph convolutional projection to learn dependencies among tokens, leveraging inductive bias to attend to local features in early layers.
- To address the low-rank bottleneck in attention heads, GvT employs talking-heads attention by performing sparse selection on the attention tensor to eliminate redundancy and enable interaction among filtered attention scores.
- GvT also applies graph-pooling between intermediate blocks to reduce the number of tokens and aggregate semantic information more effectively.
- Extensive experiments on small vision datasets like ClipArt, CIFAR-100, Oxford-IIIT Pet, Sketch, Chest X-ray, and COVID-CT demonstrate that GvT outperforms convolutional neural networks and other vision transformer variants trained from scratch.
- Ablation studies validate the effectiveness of the proposed graph convolutional projection and talking-heads attention in GvT.
統計資料
The ClipArt dataset contains 33,525 training images and 14,604 testing images, divided into 345 different object categories, with an average of 97 images per category.
The CIFAR-100 dataset contains 60,000 32x32 color images in 100 classes, with 500 training and 100 testing images per class.
The Oxford-IIIT Pet dataset contains around 100 training images for each of the 37 pet categories.
The Sketch-Subset dataset contains 1,997 training images and 866 testing images across 16 categories.
The Chest X-ray dataset contains 1,600 training images and 2,000 testing images, with 1,800 images in each of the normal and abnormal categories.
The COVID-CT dataset contains 280 training images and 186 testing images, with 349 positive and 397 negative examples.
引述
"When training from scratch on small datasets, there is still a significant performance gap between ViTs and Convolutional Neural Networks (CNNs), which is attributed to the lack of inductive bias."
"To address this issue, we propose a Graph-based Vision Transformer (GvT) that utilizes graph convolutional projection and graph-pooling."
"To overcome the low-rank bottleneck in attention heads, we employ talking-heads technology based on bilinear pooled features and sparse selection of attention tensors."