The paper introduces a new learning paradigm called "non-fine-tunable learning" to address the issue of powerful pre-trained models being misused for unethical or illegal tasks. The key idea is to protect a pre-trained model from being fine-tuned for restricted domains while preserving its performance on the original task.
The authors develop SOPHON, a non-fine-tunable learning framework, that consists of two key optimization modules:
Fine-Tuning Suppression in Restricted Domain: This module is designed to degrade the fine-tuning performance of the pre-trained model in the restricted domain. It uses simulated fine-tuning processes to approximate the model performance after fine-tuning and optimizes the model accordingly.
Normal Training Reinforcement in Original Domain: This module is designed to maintain the performance of the protected model in the original domain.
The authors propose alternative loss functions, i.e., inverse cross-entropy and KL divergence from uniform distribution, to facilitate the convergence of the fine-tuning suppression process.
Extensive experiments are conducted on classification and generation tasks, covering seven restricted domains and six model architectures. The results show that fine-tuning the SOPHON-protected models incurs an overhead comparable to or even greater than training from scratch, effectively preventing the misuse of pre-trained models.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問