מושגי ליבה
Latent Guard is a framework designed to efficiently detect the presence of blacklisted concepts in text-to-image generation input prompts, enabling robust safety measures without the need for expensive retraining.
תקציר
The paper introduces Latent Guard, a novel safety framework for text-to-image (T2I) generation models. Existing safety measures for T2I models are either based on text blacklists, which can be easily circumvented, or require large datasets for training harmful content classification, offering low flexibility.
Latent Guard proposes a different approach, focusing on detecting the presence of blacklisted concepts in the latent representation of input prompts, rather than directly classifying prompts as safe or unsafe. The key components are:
Data Generation Pipeline: The authors create a dataset called CoPro, which includes safe and unsafe prompts centered around a set of blacklisted concepts. This data is generated using large language models.
Embedding Mapping Layer: Latent Guard uses a trainable architectural component on top of a pre-trained text encoder to extract latent representations of input prompts and blacklisted concepts. This layer employs multi-head cross-attention to focus on the relevant tokens in the prompt.
Contrastive Training: Latent Guard is trained using a contrastive learning strategy, which maps the latent representations of unsafe prompts and their corresponding blacklisted concepts close together, while separating them from safe prompts.
During inference, Latent Guard efficiently checks the cosine similarity between the latent representation of the input prompt and the pre-computed embeddings of blacklisted concepts. If any similarity exceeds a threshold, the prompt is blocked, preventing the generation of unsafe content.
The authors thoroughly evaluate Latent Guard on the CoPro dataset and existing datasets, demonstrating its effectiveness in detecting unsafe prompts, including those with adversarial attacks targeting the text encoder. Latent Guard also offers the flexibility to update the blacklist of concepts at test time without retraining.
סטטיסטיקה
Latent Diffusion Models [27] perform the diffusion process in an autoencoder latent space, significantly lowering computational requirements.
The authors use an uncensored version of Mixtral 8x7B for generating data.
The authors use the CLIP Transformer [24] as the text encoder, which is also employed in Stable Diffusion v1.5 [27] and SDXL [21].
ציטוטים
"Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings."
"Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data."