The paper presents a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE) language models from trained models. The key insights are:
Mixing a source model with pre-trained, domain-specialized expert models is an effective way to augment the capabilities of the source model without extensive fine-tuning.
The toolkit offers flexibility in how the MOE is constructed, including a Gate-less MOE that assigns equal weight to each expert, and a Noisy MOE that uses a simple linear layer to determine the top K experts for each token.
Router training can provide some benefit, particularly on math-focused tasks, but is not always necessary to achieve competitive performance.
The MOE approach can outperform the source model and individual expert models, with the optimal configuration depending on the specific use case and available expert models.
The toolkit supports mixing both full FFN layers and LoRA adapters as experts, and provides options to train the routers, embeddings, or a combination.
Overall, the low-cost MOE creation approach enables rapid customization of language models to specific needs by leveraging pre-trained domain experts.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문