The paper analyzes the limitations of existing adversarial attack methods on large language models (LLMs), specifically their poor transferability and significant time overhead.
The key insights are:
To address these issues, the authors propose TF-ATTACK, which employs an external LLM (e.g., ChatGPT) as a third-party overseer to identify critical units within sentences. TF-ATTACK also introduces the concept of Importance Level, which allows for parallel substitutions of attacks.
The paper demonstrates that TF-ATTACK consistently outperforms previous methods in transferability and delivers significant speedups, up to 20x faster than earlier attack strategies. Extensive experiments on 6 benchmarks, with both automatic and human evaluations, validate the effectiveness of the proposed approach.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Zelin Li, Ke... às arxiv.org 09-10-2024
https://arxiv.org/pdf/2408.13985.pdfPerguntas Mais Profundas