Core Concepts
The author explores how the architecture of FFN-Wider Transformers affects base capabilities, focusing on the contribution ratio of combination functions. The proposed Combination Enhanced Architecture (CEA) aims to reverse the decline in base capabilities.
Abstract
The study delves into how architecture influences pre-trained language models' base capabilities, particularly focusing on FFN-Wider Transformers. By analyzing the contribution ratio of combination functions, the research introduces CEA as a solution to enhance base capabilities. Experimental results demonstrate improvements in model performance with CEA implementation.
The content discusses the impact of architecture on language models' abilities beyond in-distribution language modeling, including out-of-distribution language modeling and transfer learning. It highlights the importance of understanding architecture's role in enhancing base capabilities and proposes practical solutions like CEA.
Key findings reveal that altering the width ratio of FFN layers can significantly affect model performance, with a decrease in transformation function contribution leading to a decline in base capabilities. The study extends its analysis to MoE Transformers, showcasing potential improvements through architectural enhancements.
Overall, the research provides valuable insights into optimizing architecture for enhanced base capabilities in pre-trained language models, offering guidance for future improvements and design considerations.
Stats
"Under similar pre-training performance, the FFN-Wider BERT models demonstrate a noticeable decline in base capabilities compared to vanilla BERT models."
"As shown in Figure 1, under similar pre-training performance, the FFN-Wider BERT models exhibit a noticeable decline in both out-of-distribution language modeling and downstream tasks fine-tuning compared to vanilla BERT models."
Quotes
"The actual contribution ratio of the MHA layer is a key factor affecting model’s base capabilities."
"Controlling the width ratio indeed directly influences the contribution ratio of the combination function."