Core Concepts
Optimizing vision transformers with a novel spectral convolutional approach using real vs. complex operators.
Abstract
The content discusses the development of a Spectral Convolutional Transformer (SCT) that combines local, global, and long-range dependencies in vision transformers. By leveraging the simplicity of the Hartley transform and convolutional operators, SCT achieves state-of-the-art performance on ImageNet and various downstream tasks. Extensive experiments demonstrate the effectiveness of SCT in capturing both local and global relationships in images while reducing computational complexity.
Directory:
Introduction
Evolution of vision-based transformers like ViT, DeiT, Swin, and BeIT.
Need for including local information in transformers.
Real vs. Complex Spectral Vision Transformers
Challenges faced by self-attention networks in capturing local relationships.
Introduction of a novel transformer architecture leveraging the Hartley transform.
Method
Description of the SCT architecture with Spectral Convolutional and Attention Modules.
Experiment
Comprehensive evaluation of SCT on image recognition and instance segmentation tasks.
Performance comparison with other vision backbones on ImageNet.
Ablation Studies
Comparison of different spectral transforms and their impact on performance.
Transfer Learning Studies
Performance comparison of SCT-C on CIFAR-10, CIFAR-100, Oxford Flower, and Stanford Car datasets.
Task Learning Studies
Comparison of SCT-C with other transformers on instance segmentation tasks using the MSCoco dataset.
Stats
SCT-C-small gives 84.5% top-1 accuracy on ImageNet.
SCT-C-Large reaches 85.9% and SCT-C-Huge reaches 86.4%.
Quotes
"We advocate combining three diverse views of data - local, global, and long-range dependence."
"Through extensive experiments, we show that SCT-C-small gives state-of-the-art performance on the ImageNet dataset."