Core Concepts
A coarse-to-fine framework, CoFiTune, is proposed to strike a balance between the speciality and versatility of large language models by selectively fine-tuning specific modules within a defined layer range.
Abstract
The content discusses the challenge of balancing speciality and versatility in large language models (LLMs). Aligned LLMs exhibit remarkable versatility, but they often fall short in certain tasks or domains, requiring fine-tuning to gain speciality. However, fine-tuning can lead to catastrophic forgetting (CF), causing a significant loss of versatility.
To address this challenge, the authors propose a Coarse-to-Fine framework, CoFiTune, which consists of two key components:
-
Coarse-grained Level:
- An empirical tree-search algorithm is used to identify and update specific modules (e.g., feed-forward network, FFN) within a defined layer range that are crucial for gaining speciality without significantly affecting versatility.
- The remaining parameters are kept frozen to further preserve versatility.
-
Fine-grained Level:
- A fine-grained soft-masking mechanism is employed to regulate the backward gradient flow based on the importance of units (attention heads or neurons) for versatility, further mitigating the CF issue without harming speciality.
The authors conduct extensive experiments across diverse tasks and model scales, including a newly introduced Chinese CF setting. The results demonstrate that CoFiTune consistently outperforms baseline methods, achieving over 95% versatility and 90% speciality compared to the original and full supervised fine-tuning (SFT) models on average.
The authors also provide several key insights:
- The "(N × 25%, N × 50%] - FFN" configuration yields the best overall performance across all tasks and model scales.
- The fine-grained soft-masking mechanism effectively mitigates the CF in versatility without harming speciality.
- The FFN module, especially the down-projection, is more crucial than the multi-head attention (MHA) module for gaining speciality.
- The versatility of LLMs may predominantly reside in the lower layer range (0, N × 25%], particularly within the FFN module.
These insights contribute to a better understanding of the information forwarding process in LLMs and provide valuable guidance for future research in this field.
Stats
"The presence of redundant parameters in Transformer-based models is crucial to identify the key components and accurately comprehend their internal mechanisms."
"Recent endeavors try to analyze the layers and modules in Transformer, revealing an information-copying behavior within the attention module and considering the up and down projection matrices in the FFN as the key and value of the memories."
Quotes
"Aligned LLMs showcase remarkable versatility, capable of handling diverse real-world tasks. Meanwhile, aligned LLMs are also expected to exhibit speciality, excelling in specific applications."
"Fine-tuning with extra data, a common practice to gain speciality, often leads to catastrophic forgetting (CF) of previously acquired versatility, hindering the model's performance across diverse tasks."
"CoFiTune consistently outperforms all baseline methods across diverse tasks and model scales. When compared to the full-parameter SFT, CoFiTune offers an average versatility improvement of 14%, while only incurring a marginal loss in speciality."