핵심 개념
Red-teaming methodology uncovers safety vulnerabilities in text-to-image models through implicitly adversarial prompts.
초록
The Adversarial Nibbler Challenge focuses on evaluating the robustness of text-to-image (T2I) models against safety violations. By crowdsourcing implicitly adversarial prompts, the challenge aims to uncover edge cases and long-tail risks often overlooked in standard testing. The challenge involves diverse participants submitting prompts that trigger T2I models to generate unsafe images for non-obvious reasons. Key highlights include:
Importance of evaluating model robustness against implicit adversarial attacks.
Challenges in identifying and mitigating safety violations in generative models.
Novel attack strategies revealed by participants.
Discrepancies between human and machine safety annotations.
Recommendations for red-teaming efforts and T2I model benchmarking.
The content is structured into sections covering the introduction, background, dataset description, analysis results, discussion, recommendations, limitations, and future work.
통계
We find that 14% of images considered harmful by humans are mislabeled as "safe" by machines.
Over 70% of participants came from North America and Europe during Round 1 of the challenge.