Core Concepts
A novel continual learning approach that automatically expands pre-trained vision transformers by adding modular adapters and representation descriptors to accommodate distribution shifts in incoming tasks, without the need for memory rehearsal.
Abstract
The paper proposes a continual learning framework called SEMA (Self-Expansion of pre-trained Models with Modularized Adaptation) that can be integrated into transformer-based pre-trained models like Vision Transformer (ViT).
Key highlights:
- SEMA employs a modular adapter design, where each adapter module consists of a functional adapter and a representation descriptor. The representation descriptor captures the distribution of the relevant input features and serves as an indicator of distribution shift during training.
- SEMA automatically decides whether to reuse existing adapters or add new ones based on the distribution shift detected by the representation descriptors. This allows the model to efficiently accommodate changes in the incoming data distribution without overwriting previously learned knowledge.
- An expandable weighting router is learned jointly with the adapter modules to effectively combine the outputs of different adapters.
- SEMA operates without the need for memory rehearsal and outperforms state-of-the-art ViT-based continual learning methods on various benchmarks, including CIFAR-100, ImageNet-R, ImageNet-A, and VTAB.
The paper also provides extensive ablation studies and analyses to validate the effectiveness of the proposed self-expansion mechanism and the design choices.
Stats
"We demonstrate that the proposed framework outperforms the state-of-the-art without memory rehearsal."
"By comparing with vision-transformer-based continual learning adaptation methods, we demonstrate that the proposed framework outperforms the state-of-the-art without memory rehearsal."
Quotes
"To address the difficulty of maintaining all seen data and repeatedly re-training the model, continual learning (CL) aims to learn incrementally from a continuous data stream [15, 52, 62]."
"While most CL methods [6,42,70] root in the "training-from-scratch" paradigm, recent works have started to explore the potential of integrating pre-trained foundation models into CL as robust feature extractors [45, 79], or adapting them to downstream tasks through parameter-efficient fine-tuning with prompts and/or adapters [14, 60, 65, 66, 79, 80]."