Concetti Chiave
Multi-Criteria Token Fusion optimizes vision transformers by fusing tokens based on multi-criteria, achieving superior performance and efficiency.
Sintesi
Vision Transformers (ViTs) are widely used in computer vision tasks. Recent works focus on token reduction methods to optimize ViTs without changing their architecture. Multi-Criteria Token Fusion (MCTF) introduces a novel approach that considers similarity, informativeness, and token size to minimize information loss during fusion. By incorporating one-step-ahead attention and token reduction consistency, MCTF achieves the best speed-accuracy trade-off in various ViTs. Experimental results show significant improvements in accuracy (+0.5% to +0.3%) with reduced FLOPs by about 44%. MCTF outperforms previous reduction methods without training, demonstrating its efficiency and applicability across different Vision Transformers.
Statistiche
DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44%
DeiT-T with MCTF achieves a performance improvement of +0.5%
DeiT-S with MCTF improves performance by +0.3%
Citazioni
"MCTF achieves the best speed-accuracy trade-off in diverse ViTs."
"Our contributions are summarized in fourfold."