Vision Transformers (ViTs) are widely used in computer vision tasks. Recent works focus on token reduction methods to optimize ViTs without changing their architecture. Multi-Criteria Token Fusion (MCTF) introduces a novel approach that considers similarity, informativeness, and token size to minimize information loss during fusion. By incorporating one-step-ahead attention and token reduction consistency, MCTF achieves the best speed-accuracy trade-off in various ViTs. Experimental results show significant improvements in accuracy (+0.5% to +0.3%) with reduced FLOPs by about 44%. MCTF outperforms previous reduction methods without training, demonstrating its efficiency and applicability across different Vision Transformers.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Sanghyeok Le... a las arxiv.org 03-18-2024
https://arxiv.org/pdf/2403.10030.pdfConsultas más profundas