The authors present Efficient Modulation (EfficientMod), a novel design for efficient vision networks. They revisit the modulation mechanism, which operates on input through convolutional context modeling and feature projection layers, and fuses features via element-wise multiplication and an MLP block.
The authors demonstrate that the modulation mechanism is well-suited for efficient networks and propose the EfficientMod block as the essential building block for their networks. EfficientMod benefits from the representational ability of the modulation mechanism and the authors' efficient design.
The authors show that their EfficientMod-based networks can achieve better trade-offs between accuracy and efficiency compared to previous state-of-the-art efficient networks. When integrating EfficientMod with the vanilla self-attention block, the authors obtain a hybrid architecture that further improves performance without loss of efficiency.
Extensive experiments are conducted to verify the performance of EfficientMod. EfficientMod-s outperforms EfficientFormerV2-s2 by 0.6 top-1 accuracy and is 25% faster on GPU. EfficientMod also substantially outperforms EfficientFormerV2 on downstream tasks like semantic segmentation, outperforming it by 3.6 mIoU on the ADE20K benchmark.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Xu Ma,Xiyang... at arxiv.org 04-01-2024
https://arxiv.org/pdf/2403.19963.pdfDeeper Inquiries