The paper introduces an inference-time attack method called emulated disalignment (ED) that can reverse the safety alignment of large language models (LLMs). The key insights are:
The authors systematically evaluate ED across three datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca). The results show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rate in 43 out of 48 evaluation subsets. The authors also conduct synthetic experiments to provide a mechanical understanding of ED, demonstrating that stronger alignment equals greater potential for harm and that emulated disalignment can be competitive with resource-intensive direct disalignment.
The findings highlight the importance of reevaluating the practice of open-sourcing language models, even after safety alignment, as ED's need for language model output token distributions particularly compromises open-source models.
To Another Language
from source content
arxiv.org
ข้อมูลเชิงลึกที่สำคัญจาก
by Zhanhui Zhou... ที่ arxiv.org 04-04-2024
https://arxiv.org/pdf/2402.12343.pdfสอบถามเพิ่มเติม