The democratization of pre-trained language models has led to security risks like backdoor attacks. The paper suggests using model merging to mitigate these vulnerabilities effectively. By merging backdoored models with other models, the approach offers a cost-free solution to enhance security without retraining.
The study explores different models and datasets, demonstrating a 75% reduction in attack success rate compared to advanced defenses. Backdoor attacks manipulate model behavior with specific triggers, compromising integrity. Various defensive strategies exist but often require extra resources or specific knowledge.
Model merging techniques can counter backdoor attacks without external knowledge or retraining, consistently outperforming baselines. The method is adaptable across different settings and architectures, offering an efficient inference-stage defense strategy.
Experimental results show the effectiveness of model merging in reducing attack success rates across various datasets and poisoning rates. The approach remains robust even when merging models from different domains or trained for varying epochs.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Ansh Arora,X... om arxiv.org 03-01-2024
https://arxiv.org/pdf/2402.19334.pdfDiepere vragen