The paper analyzes the limitations of existing adversarial attack methods on large language models (LLMs), specifically their poor transferability and significant time overhead.
The key insights are:
To address these issues, the authors propose TF-ATTACK, which employs an external LLM (e.g., ChatGPT) as a third-party overseer to identify critical units within sentences. TF-ATTACK also introduces the concept of Importance Level, which allows for parallel substitutions of attacks.
The paper demonstrates that TF-ATTACK consistently outperforms previous methods in transferability and delivers significant speedups, up to 20x faster than earlier attack strategies. Extensive experiments on 6 benchmarks, with both automatic and human evaluations, validate the effectiveness of the proposed approach.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問