Vision Transformers (ViTs) are widely used in computer vision tasks. Recent works focus on token reduction methods to optimize ViTs without changing their architecture. Multi-Criteria Token Fusion (MCTF) introduces a novel approach that considers similarity, informativeness, and token size to minimize information loss during fusion. By incorporating one-step-ahead attention and token reduction consistency, MCTF achieves the best speed-accuracy trade-off in various ViTs. Experimental results show significant improvements in accuracy (+0.5% to +0.3%) with reduced FLOPs by about 44%. MCTF outperforms previous reduction methods without training, demonstrating its efficiency and applicability across different Vision Transformers.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問