Core Concepts
The authors propose a novel Eagle Vision Transformer (EViT) that combines the advantages of convolution and vision transformers, inspired by the unique bi-fovea physiological structure and visual properties of eagle eyes. EViT exhibits highly competitive performance and computational efficiency across various computer vision tasks.
Abstract
The authors present the Eagle Vision Transformer (EViT), a novel backbone network for computer vision tasks that is inspired by the bi-fovea visual system of eagle eyes.
Key highlights:
- The authors propose a Bi-Fovea Self-Attention (BFSA) module that simulates the shallow and deep fovea of eagle vision, enabling the network to learn feature representations from coarse to fine.
- They introduce a Bi-Fovea Feedforward Network (BFFN) that mimics the hierarchical and parallel information processing of the biological visual cortex.
- The authors design a Bionic Eagle Vision (BEV) block that combines BFSA and BFFN, and use it to build a general pyramid backbone network family called EViTs.
- Experiments show that EViTs achieve highly competitive performance and computational efficiency compared to other state-of-the-art vision transformer models across tasks like image classification, object detection, and semantic segmentation.
The authors demonstrate the potential of combining eagle vision with vision transformers, and show that EViTs can bring significant performance improvements in computer vision.
Stats
EViT-Tiny achieves 79.9% top-1 accuracy on ImageNet-1K with 1.91 GFLOPs.
EViT-Base achieves 83.9% top-1 accuracy on ImageNet-1K with 6.35 GFLOPs.
EViT-Large achieves 84.9% top-1 accuracy on ImageNet-1K with 12.5 GFLOPs.
EViT-Small and EViT-Base outperform other backbones by at least 0.4% and 0.5% in object detection and instance segmentation on COCO 2017.
EViT-Small and EViT-Base achieve 46.1% and 48.5% mIoU on ADE20K semantic segmentation, outperforming PVT by at least 0.9%.
Quotes
"Benefiting from biological eagle vision, we propose a novel Bi-Fovea Self-Attention (BFSA). It used to simulate the shallow and deep fovea of eagle vision, prompting the network to learn the feature representation of targets from coarse to fine."
"Taking inspiration from neuroscience, we continue the bi-fovea structure design principle of eagle vision, introduce a Bi-Fovea Feedforward Network (BFFN), and design a Bionic Eagle Vision (BEV) block based on the BFSA and BFFN."
"Following the hierarchical design concept, we propose a general and efficient pyramid backbone network family called EViTs. In terms of computational efficiency and performance, EViTs show significant competitive advantages compared with other counterparts."