The paper demonstrates that applying moderate WANDA pruning (10-20% sparsity) to aligned large language models like LLaMA-2 Chat, Vicuna 1.3, and Mistral Instruct v0.2 can increase their resistance to jailbreaking prompts, which aim to induce the generation of harmful content. This safety enhancement is achieved without any additional fine-tuning of the models.
The authors first curated a dataset of 225 malicious tasks across 5 categories and integrated them into 10 distinct jailbreaking prompts. They then evaluated the refusal rates of the unpruned and pruned versions of the 3 models when exposed to these prompts.
The results show that the most safety-aligned model, LLaMA-2 Chat, exhibited the highest increase in jailbreaking resistance after pruning, with an average 8.5% increase in refusal rates across the 5 task categories. In contrast, the least aligned model, Mistral Instruct v0.2, saw minimal safety improvement post-pruning.
The authors propose that the safety benefits of pruning can be understood through a regularization perspective. They demonstrate that pruned models:
Overall, the findings suggest that moderate parameter pruning can be a viable technique to enhance the safety of aligned large language models without the need for additional fine-tuning.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Adib Hasan,I... kl. arxiv.org 04-30-2024
https://arxiv.org/pdf/2401.10862.pdfDybere Forespørgsler