Reverse-Engineering and Editing Backdoor Mechanisms in Transformer-Based Language Models
Backdoored language models can produce toxic outputs when triggered, posing a security threat. This work analyzes the internal representations and mechanisms behind such backdoor behaviors, and introduces techniques to remove, insert, and modify the backdoor mechanisms in transformer-based language models.