핵심 개념
GUARDT2I enhances T2I models' safety by detecting and rejecting adversarial prompts effectively.
초록
The article introduces GUARDT2I, a novel moderation framework to enhance the safety of Text-to-Image (T2I) models by detecting and rejecting adversarial prompts. It addresses the concerns of generating inappropriate or Not-Safe-For-Work (NSFW) content by utilizing a generative approach to enhance T2I models' robustness against adversarial prompts. The study compares GUARDT2I with leading commercial solutions and demonstrates its superior performance across diverse adversarial scenarios.
The content is structured as follows:
Introduction to the safety concerns of T2I models.
Defensive methods categorized into model fine-tuning and post-hoc content moderation.
Proposal of GUARDT2I as a generative moderation framework.
Detailed explanation of GUARDT2I's design, including c·LLM, Verbalizer, and Sentence Similarity Checker.
Experimental settings, including dataset, target model, adversarial prompts, model architecture, and training.
Evaluation metrics: AUROC, AUPRC, and FPR@TPR95.
통계
GUARDT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator.
The c·LLM within GUARDT2I is fine-tuned using a dataset of 10 million prompts from LAION-COCO.
The Sentence Similarity Checker utilizes SBERT for detecting mismatches between the output of c·LLM and the original prompt.
인용구
"Addressing this challenge, our study unveils GUARDT2I, a novel moderation framework that adopts a generative approach to enhance T2I models’ robustness against adversarial prompts."
"Our extensive experiments reveal that GUARDT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios."