The Divide-and-Conquer Attack introduces a novel method to bypass safety filters in Text-to-Image (TTI) models by leveraging Large Language Models (LLMs). By breaking down unethical prompts into benign descriptions of individual image elements, the attack successfully generates images containing unethical content. The attack strategy involves dividing the unethical source into separate visual components and describing them individually to create adversarial prompts that evade safety filters. Through extensive evaluation, the attack demonstrates high success rates in bypassing safety filters of state-of-the-art TTI engines like DALL·E 3 and Midjourney. The approach is cost-effective and adaptable to evolving defense mechanisms.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Yimo Deng,Hu... lúc arxiv.org 03-15-2024
https://arxiv.org/pdf/2312.07130.pdfYêu cầu sâu hơn