This paper examines the different components of the BERT model to determine which one undergoes the most significant changes during the fine-tuning process for various natural language processing (NLP) tasks in the GLUE benchmark. The analysis reveals that the output layer normalization (LayerNorm) component experiences the most substantial changes compared to other components.
The authors then demonstrate that fine-tuning only the LayerNorm component can achieve comparable, or in some cases better, performance to full fine-tuning and other parameter-efficient fine-tuning methods, such as BitFit. Moreover, they use Fisher information to identify the most critical subset of LayerNorm parameters and show that many GLUE tasks can be solved by fine-tuning only a small portion of LayerNorm with negligible performance degradation.
The paper also presents a cross-validation experiment to show that the selected subset of LayerNorm parameters is generalizable and can be applied to new tasks without the need for task-specific fine-tuning. These findings suggest that LayerNorm is a crucial component for parameter-efficient fine-tuning of large language models, and that only a small portion of its parameters need to be updated to achieve strong performance on a wide range of NLP tasks.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Taha Valizad... at arxiv.org 04-01-2024
https://arxiv.org/pdf/2403.20284.pdfDeeper Inquiries