Identifying Benign Data that Inadvertently Degrades Safety in Large Language Models
Seemingly benign data can significantly degrade the safety of previously aligned large language models after fine-tuning, and our data-centric methods can effectively identify such harmful subsets of benign data.