The paper introduces a methodology for efficiently fusing pre-trained transformer-based models, such as Vision Transformers (ViTs) and BERT, to combine their capabilities. The key contributions are:
A novel graph-based interpretation of the transportation map flow, which allows handling the idiosyncratic architectural components of transformers, such as multi-head self-attention, layer normalization, and residual connections.
An analysis showing that soft alignment using the Sinkhorn algorithm outperforms hard alignment (EMD) for transformers, contrary to previous findings for simpler architectures.
Extensive experiments on image classification tasks with ViTs and natural language modeling with BERT, demonstrating that the proposed fusion approach consistently outperforms vanilla fusion and can even surpass the performance of the individual converged parent models after a short finetuning.
The ability to fuse models of different sizes (heterogeneous fusion), providing an efficient alternative to knowledge distillation.
The authors showcase the potential of fusing multiple transformers to compound their expertise, offering a promising paradigm for model fusion and recombination.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Moritz Imfel... at arxiv.org 04-23-2024
https://arxiv.org/pdf/2310.05719.pdfDeeper Inquiries