This research paper investigates the vulnerability of large language models (LLMs) to jailbreaking attacks based on string transformations. The authors introduce the concept of "string compositions," which are sequences of invertible string transformations like leetspeak, Base64 encoding, and Morse code, to bypass safety measures.
The paper highlights two main contributions:
Framework for String Compositions: The authors develop a framework that enables the creation of a vast number of string compositions by combining individual transformations. This framework allows for programmatic encoding and decoding of text, facilitating automated attacks.
Automated Jailbreak Attack: Leveraging the string composition framework, the authors design an automated "best-of-n" attack. This attack samples from a large pool of string compositions and tests their efficacy in jailbreaking LLMs. The model is considered jailbroken if at least one composition elicits an unsafe response.
The researchers evaluate their attack on popular LLM families like Claude and GPT-4o using the HarmBench dataset. Results demonstrate that while individual transformations might have limited success rates, the ensemble of transformations and the adaptive attack achieve significantly higher success rates. This finding underscores the vulnerability of LLMs to this class of attacks.
The paper concludes by emphasizing the persistent threat of string transformation-based jailbreaks to LLM security. The authors urge the AI safety research community to prioritize the development of robust defenses against these attacks. They suggest that future research should focus on understanding the underlying reasons for LLM vulnerability to string transformations and designing comprehensive mitigation strategies.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Brian R.Y. H... at arxiv.org 11-05-2024
https://arxiv.org/pdf/2411.01084.pdfDeeper Inquiries