Unveiling the WMDP Benchmark: Assessing and Mitigating Malicious Use Through Unlearning
The authors introduce the Weapons of Mass Destruction Proxy (WMDP) benchmark to measure hazardous knowledge in biosecurity, cybersecurity, and chemical security. They propose Contrastive Unlearn Tuning (CUT) as a method to remove hazardous knowledge while preserving general model capabilities.