Core Concepts
Large Language Models (LLMs) face risks of misuse in conversations, prompting research on attacks, defenses, and evaluations for safety.
Abstract
LLMs pose societal risks like toxic content propagation and misinformation dissemination.
Survey covers attacks (inference-time and training-time), defenses (alignment, guidance, filters), and evaluations.
Red-team attacks aim to elicit harmful responses from LLMs without model modifications.
Template-based attacks manipulate instructions to bypass security mechanisms.
Neural prompt-to-prompt attacks tailor prompts for specific instructions.
Training-time attacks modify LLM weights through data poisoning.
Defenses include alignment, guidance with system prompts, and input/output filters.
Evaluation datasets cover toxicity, discrimination, privacy, misinformation topics in various forms.
Metrics like attack success rate and robustness assess the effectiveness of methods.
Stats
大規模言語モデル(LLMs)は、有害な応答を引き出すための赤チーム攻撃に耐える必要があります。
テンプレートベースの攻撃は、セキュリティメカニズムをバイパスするために命令を操作します。
ニューラルプロンプト対プロンプト攻撃は、特定の指示に合わせてプロンプトを調整します。
トレーニング時の攻撃は、データ毒入れを通じてLLMの重みを変更します。