The paper explores an overlooked approach to continual learning that does not rely on techniques developed specifically for continual learning. Instead, it focuses on carefully tuning optimizers with adaptive learning rates, such as Adagrad, in combination with sparse activation functions.
The key findings are:
Sparse activation functions, particularly the novel Hard Adaptive SwisH (Hard ASH) activation, outperform non-sparse activations like ReLU in the Split-MNIST continual learning task.
Adagrad, an adaptive learning rate optimizer, performs the best among the tested optimizers, followed by RMSprop and biased Adam. Removing the bias correction from Adam significantly boosts its performance.
The sparsity introduced by the activation functions helps with stability, i.e., retaining performance on old tasks, without significantly impacting plasticity, i.e., the ability to learn new tasks.
The approach outperforms established continual learning methods like EWC, SDMLP, and FlyModel on the Split-MNIST task, using a simpler and more conceptually straightforward method.
The paper highlights the effectiveness of sparse representations and adaptive optimizers in building a surprisingly effective continual learner without specialized continual learning algorithms.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Santtu Keski... at arxiv.org 04-30-2024
https://arxiv.org/pdf/2404.17651.pdfDeeper Inquiries