Belangrijkste concepten
Task Vectors can be used to erase unsafe concepts from text-to-image diffusion models in an input-independent manner, providing better unconditional safety compared to existing concept erasure methods.
Samenvatting
The authors investigate the limitations of existing concept erasure methods for text-to-image (T2I) generative models, which often rely on specific user prompts and can be circumvented by adversarial inputs. To address this, they propose using Task Vectors (TVs) as a method for unconditional concept erasure.
Key highlights:
Existing concept erasure methods are input-dependent, only protecting against specific user prompts and leaving the model vulnerable to unexpected inputs.
The authors define an "unconditional safety" criterion that measures the model's robustness to adversarial prompts of increasing complexity, going beyond specific user inputs.
Experiments on a toy MNIST model show that TV-based concept erasure provides better unconditional safety compared to input-dependent fine-tuning.
For large Stable Diffusion models, the authors propose "Diverse Inversion" to estimate the required TV edit strength without relying on specific prompts.
Diverse Inversion allows them to apply the TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.