Centrala begrepp
Injecting backdoors into Large Language Models using the BadEdit framework with minimal data requirements and efficient model editing techniques.
Sammanfattning
The content discusses the introduction of the BadEdit attack framework for injecting backdoors into Large Language Models (LLMs) efficiently. It highlights the limitations of existing backdoor injection methods, introduces the concept of lightweight knowledge editing for backdoor injection, and presents experimental results demonstrating the effectiveness and efficiency of BadEdit. The content covers data construction, duplex model parameter editing, deriving trigger-target representations, incremental batch edits, experiments on different tasks, robustness testing, efficiency comparisons with baseline methods, and ablation studies.
Structure:
- Introduction to Backdoor Attacks in LLMs
- Formulation of Lightweight Editing for Backdooring
- Data Construction and Model Parameter Editing
- Experiments on Attack Effectiveness
- Evaluation of Side Effects and Robustness
- Efficiency Comparison with Baseline Methods
- Robustness Testing and Defense Strategies
- Conclusion and Acknowledgement
Statistik
BadEdit necessitates only a minimal dataset for injection (15 samples).
Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100% success rate.
The model's performance drops dramatically on various settings when attacked by baseline methods.
Our proposed method has a significant advantage in terms of data usage, GPU memory consumption, and time required for backdoor injection.
Citat
"BadEdit boasts superiority over existing backdoor injection techniques."
"Our approach leverages lightweight model editing techniques to avoid catastrophic forgetting."
"Our proposed method achieves up to 100% attack success rate across various settings."