Core Concepts
Despite claims of other models being more efficient, the original Vision Transformer (ViT) remains Pareto optimal across multiple efficiency metrics, including accuracy, speed, and memory usage. Hybrid attention-CNN models exhibit remarkable inference memory- and parameter-efficiency, while using a larger model is generally more efficient than using higher resolution images.
Abstract
The authors conduct a comprehensive benchmark of over 150 experiments on more than 35 efficient vision transformer models to evaluate their efficiency across various metrics, including accuracy, speed, and memory usage. They find that despite claims of other models being more efficient, the original ViT remains Pareto optimal across multiple metrics.
The key insights from the study are:
Hybrid attention-CNN models, such as EfficientFormerV2-S0 and CoaT-Ti, exhibit remarkable inference memory- and parameter-efficiency, outperforming other attention-based models as well as ResNet50.
Using a larger model is generally more efficient than using higher resolution images. Fine-tuning at a higher resolution (384px) results in improved accuracy but also a significant increase in computational cost, leading to a substantial reduction in throughput.
The authors observe that ViT remains Pareto optimal for three out of four metrics: training and inference speed, as well as training memory. Other efficiency strategies, such as token sequence reduction methods, can become viable alternatives when speed and training efficiency are of importance.
For scenarios with significant inference memory constraints, the authors recommend considering hybrid attention models.
The authors provide a centralized resource for practitioners and researchers, facilitating informed decisions when selecting transformers or measuring progress in the development of efficient transformers.
Stats
The authors report the following key metrics:
ImageNet-1k validation accuracy
Throughput in images per second
Training and inference memory requirements in GB of VRAM
Number of parameters
Quotes
"Despite claims of other models being more efficient, ViT remains Pareto optimal across multiple metrics."
"Hybrid attention-CNN models exhibit remarkable inference memory- and parameter-efficiency."
"Using a larger model is generally more efficient than using higher resolution images."