TroubleLLM introduces a new method to address safety concerns with LLMs by generating test prompts that are both high-quality and controllable. The existing methods for testing LLMs are deemed unsatisfactory due to issues like labor-intensiveness, lack of diversity, and domain-specific limitations. TroubleLLM aims to overcome these challenges by focusing on the generation quality and controllability of test prompts. By training TroubleLLM through a text style transfer task with specific conditions like keywords, topics, and instruction attacks, the model can produce diverse and effective test prompts. Extensive experiments and human evaluations demonstrate the superiority of TroubleLLM in terms of generation quality and controllability.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Zhuoer Xu,Ji... a las arxiv.org 03-05-2024
https://arxiv.org/pdf/2403.00829.pdfConsultas más profundas