Optimizing Learning Rate Distributions to Mitigate Catastrophic Forgetting in Transformer-based Language Models
Carefully tuning the learning rate distribution across different layers of a transformer-based language model can effectively mitigate the problem of catastrophic forgetting during sequential fine-tuning on different datasets.