The paper examines the impact of downstream tasks such as fine-tuning and quantization on the vulnerability of large language models (LLMs) to jailbreaking attacks. The authors use the state-of-the-art Tree-of-attacks pruning (TAP) algorithm to test the jailbreak resistance of various foundation models (Llama2, Mistral, MosaicML) and their fine-tuned versions (CodeLlama, SQLCoder, Dolphin, Intel Neural Chat).
The results show that fine-tuning and quantization can reduce the jailbreak resistance of LLMs considerably, making them more vulnerable to attacks. For example, the jailbreak success rate increases from 6% for the Llama2-7B model to 32% for its fine-tuned version, CodeLlama-7B, and further to 82% for the SQLCoder-2 model, which is fine-tuned on top of CodeLlama-7B.
The authors also demonstrate the effectiveness of external guardrails in mitigating these vulnerabilities. By using a custom-trained jailbreak attack detector, they are able to reduce the jailbreak success rate by a factor of 9x for the Llama2-7B model and 16x for the CodeLlama-7B model.
The paper highlights the importance of incorporating safety protocols during the fine-tuning process and the need to integrate guardrails as a standard practice in responsible AI development. This approach can help ensure that the advancements in large language models prioritize both innovation and security, fostering a secure digital future.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Divyanshu Ku... om arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.04392.pdfDiepere vragen