Conceitos Básicos
Cross-Architecture Transfer Learning (XATL) can significantly reduce the training time and improve the performance of Low-Cost Inference (LCI) Transformer models by directly transferring compatible weights from pre-trained Transformer models, without the need to train the LCI models from scratch.
Resumo
The paper proposes a weight transfer learning paradigm called Cross-Architecture Transfer Learning (XATL) to address the costly pre-training requirement of Low-Cost Inference (LCI) Transformer models. LCI models, such as RetNet and Mamba, aim to improve the efficiency of Transformer Language Models by changing the design of the self-attention block to have linear-cost inference. However, this architectural change typically requires full pre-training of the weights from scratch, which incurs a significant cost.
The key idea of XATL is to directly transfer the weights of the shared components between LCI and self-attention-based Transformers, such as layer norms, MLPs, and input/output embeddings, from a pre-trained Transformer model to the new LCI architecture. This allows the LCI models to benefit from the strong initialization and accelerated training, without the need for full pre-training.
The authors conducted extensive experiments on varying model sizes and alternative attention architectures, including RetNet and Striped-Mamba. The results show that XATL can significantly reduce the training time by up to 2.5x and converge to a better minimum with up to 2.6% stronger performance on language modeling and commonsense benchmarks, compared to training the LCI models from scratch, within the same compute budget.
The paper also explores the effects of freezing and unfreezing the transferred weights during training, as well as the benefits of using a hybrid architecture that combines attention and LCI components. The findings demonstrate the effectiveness of XATL in enabling the wider adoption of efficient LCI models by alleviating the costly pre-training requirement.
Estatísticas
The authors report the following key figures:
XATL can reduce the training time by up to 2.5x to reach the same performance as training from scratch.
XATL can improve the performance by up to 2.6% on language modeling and commonsense benchmarks, compared to training from scratch, within the same compute budget.
Citações
"Cross-Architecture Transfer Learning (XATL) can significantly reduce the training time and improve the performance of Low-Cost Inference (LCI) Transformer models by directly transferring compatible weights from pre-trained Transformer models, without the need to train the LCI models from scratch."
"XATL can reduce the training time by up to 2.5x to reach the same performance as training from scratch."
"XATL can improve the performance by up to 2.6% on language modeling and commonsense benchmarks, compared to training from scratch, within the same compute budget."