Core Concepts
A novel parameter-efficient fine-tuning method called Stratified Progressive Adaptation Fine-tuning (SPAFIT) that outperforms other PEFT methods while fine-tuning only a fraction of the parameters.
Abstract
The paper proposes a novel fine-tuning method called Stratified Progressive Adaptation Fine-tuning (SPAFIT) for pre-trained large language models. The key idea behind SPAFIT is to stratify the encoder/decoder layers into three distinct groups and apply increasingly complex fine-tuning methods as we go deeper into the network.
Group 1 layers remain frozen, as the initial layers are hypothesized to capture basic linguistic knowledge required across tasks. In Group 2, only the bias terms are allowed to change using the BitFit method. For Group 3, the attention sub-layer weights are adapted using the LoRA method, while the intermediate and output sub-layers use BitFit.
The authors evaluate SPAFIT on the GLUE benchmark and show that it outperforms other parameter-efficient fine-tuning (PEFT) methods like LoRA and BitFit, while fine-tuning significantly fewer parameters. Specifically, the SPAFIT-4-9-I and SPAFIT-4-9-II configurations achieve the best performance, fine-tuning only 5.65 million and 7.49 million parameters, respectively, out of the total 333.58 million parameters in the BERT-large-cased model.
The authors also discuss the limitations of SPAFIT, such as its performance on more complex tasks beyond classification, the numerous hyperparameters involved, and the potential for minor catastrophic forgetting issues. Future work includes exploring SPAFIT's performance on tasks like summarization and extending the method to models with both encoder and decoder stacks.
Stats
The BERT-large-cased model has a total of 333.58 million parameters.
The SPAFIT-4-9-I model fine-tunes 5.65 million parameters, while the SPAFIT-4-9-II model fine-tunes 7.49 million parameters.