TroubleLLM introduces a new method to address safety concerns with LLMs by generating test prompts that are both high-quality and controllable. The existing methods for testing LLMs are deemed unsatisfactory due to issues like labor-intensiveness, lack of diversity, and domain-specific limitations. TroubleLLM aims to overcome these challenges by focusing on the generation quality and controllability of test prompts. By training TroubleLLM through a text style transfer task with specific conditions like keywords, topics, and instruction attacks, the model can produce diverse and effective test prompts. Extensive experiments and human evaluations demonstrate the superiority of TroubleLLM in terms of generation quality and controllability.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Zhuoer Xu,Ji... às arxiv.org 03-05-2024
https://arxiv.org/pdf/2403.00829.pdfPerguntas Mais Profundas