Fine-tuning and quantization of large language models (LLMs) can significantly reduce their jailbreak resistance, leading to increased vulnerabilities.
A multi-agent attacker-disguiser game framework is proposed to strengthen the ability of large language models to generate secure responses that disguise defensive intent, avoiding exploitation by malicious attackers.
A semantics-based watermarking framework, SemaMark, is proposed to enhance the robustness of LLM-generated text detection against paraphrasing.