Vision Transformers (ViTs) are widely used in computer vision tasks. Recent works focus on token reduction methods to optimize ViTs without changing their architecture. Multi-Criteria Token Fusion (MCTF) introduces a novel approach that considers similarity, informativeness, and token size to minimize information loss during fusion. By incorporating one-step-ahead attention and token reduction consistency, MCTF achieves the best speed-accuracy trade-off in various ViTs. Experimental results show significant improvements in accuracy (+0.5% to +0.3%) with reduced FLOPs by about 44%. MCTF outperforms previous reduction methods without training, demonstrating its efficiency and applicability across different Vision Transformers.
Іншою мовою
із вихідного контенту
arxiv.org
Ключові висновки, отримані з
by Sanghyeok Le... о arxiv.org 03-18-2024
https://arxiv.org/pdf/2403.10030.pdfГлибші Запити