แนวคิดหลัก
Merging sparsely updated parameter deltas from fine-tuned Vision Transformers, guided by the principle of orthogonality, effectively combats catastrophic forgetting in continual learning tasks.
สถิติ
Merging delta parameters with a 10% masking rate can lead to a parameter collision rate of 99.96%.
Increasing the masking rate in delta merging leads to decreased parameter collisions and significantly improved model performance.
Randomly masking 90% of delta parameters in CIFAR100 across 10 tasks results in high orthogonality among the sparse deltas.
SoTU achieves a final accuracy improvement of +3.7% in Cars196, +2.3% in ImageNet-A, and +1.6% in ImageNet-R compared to the current SOTA method (RanPAC).
Without nonlinear feature projection, SoTU significantly improves classification accuracy compared to RanPAC, particularly on ImageNet-R, demonstrating its effectiveness in combating catastrophic forgetting in feature space.
Randomly masking 40% ∼60% of delta parameters retains similar attention maps, supporting the theoretical analysis.
Masking 80% ∼90% of delta parameters maintains competitive performance, indicating that a few delta parameters can store task-specific knowledge.
Larger models (e.g., ViT-L) tolerate higher delta sparsity compared to smaller models (e.g., ViT-S).
Merging high-sparsity deltas (p ≈ 0.7) achieves competitive performance with fully fine-tuned models across different datasets and ViT models.
Merging low-sparsity deltas severely hurts model performance due to parameter collisions.
A delta sparsity of 1−p ≈ 1/T, where T is the number of tasks, appears to be a promising strategy for balancing knowledge preservation and parameter collision avoidance.
คำพูด
"We found that merging sparse orthogonality of models learned from multiple streaming tasks has great potential in addressing catastrophic forgetting."
"We believe that merging sparse orthogonal delta parameters holds enormous promise in mitigating catastrophic forgetting problems."
"Our method is noteworthy for its ability to achieve optimal feature representation for streaming data without the need for any elaborate classifier designs."