The paper presents SAFEGEN, a text-agnostic framework to mitigate the generation of unsafe (sexually explicit) content by text-to-image (T2I) models. The key idea is to eliminate unsafe visual representations from the model, regardless of the text input, in order to make the T2I model resistant to adversarial prompts.
The paper first analyzes the limitations of existing defenses, which mainly focus on filtering inappropriate inputs/outputs or suppressing improper text embeddings. These methods can be bypassed by adversarial prompts that appear innocent but are ill-intended.
To address this, the paper proposes SAFEGEN, which regulates the vision-only self-attention layers of the T2I model to remove the unsafe image generation capability. This is achieved by using <nude, censored, benign> image triplets to edit the self-attention layers, without interfering with the text-dependent components.
Extensive experiments on four datasets, including adversarial prompts, demonstrate that SAFEGEN outperforms eight state-of-the-art baselines and achieves 99.1% sexual content removal performance, while preserving the high-fidelity of benign images. The paper also shows that SAFEGEN can complement and integrate with existing text-dependent defenses to further enhance the overall safety of T2I models.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询