The paper analyzes the limitations of existing adversarial attack methods on large language models (LLMs), specifically their poor transferability and significant time overhead.
The key insights are:
To address these issues, the authors propose TF-ATTACK, which employs an external LLM (e.g., ChatGPT) as a third-party overseer to identify critical units within sentences. TF-ATTACK also introduces the concept of Importance Level, which allows for parallel substitutions of attacks.
The paper demonstrates that TF-ATTACK consistently outperforms previous methods in transferability and delivers significant speedups, up to 20x faster than earlier attack strategies. Extensive experiments on 6 benchmarks, with both automatic and human evaluations, validate the effectiveness of the proposed approach.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Zelin Li, Ke... lúc arxiv.org 09-10-2024
https://arxiv.org/pdf/2408.13985.pdfYêu cầu sâu hơn