The paper proposes a novel framework that selectively removes non-essential attention layers from vision transformers and integrates their knowledge into the subsequent MLP layers. The key insights are:
The entropy quantity of attention layers in the bottom blocks is lower than that of the top blocks, and their accompanying MLP layers exhibit similar entropy levels. This suggests that the MLP layers can elicit the same information as the attention layers in the bottom blocks.
The authors propose the Entropy-based Selection Strategy (NOSE) to identify the combination of attention layers that have the minimum interaction with the final output layer, measured by transfer entropy. This allows removing attention layers with fewer dependencies on the overall network.
A simple dilution learning technique is introduced to gradually degenerate the attention layers into identical mapping, which can then be integrated into the subsequent MLP layers. This yields transformer blocks with only MLP layers in certain cases.
Experiments on ImageNet-1k, CIFAR-100, and ADE20k demonstrate that the proposed method can effectively remove up to 50% of the attention layers without compromising performance. It also significantly improves throughput and memory efficiency compared to previous token pruning methods.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor