Seyitoğlu, A., Kuvshinov, A., Schwinn, L., & Günnemann, S. (2024). Extracting Unlearned Information from LLMs with Activation Steering. arXiv preprint arXiv:2411.02631v1.
This research paper investigates the effectiveness of using activation steering as a method for extracting supposedly unlearned information from large language models (LLMs). The authors aim to determine if this technique can reveal vulnerabilities in current unlearning methods and assess the extent to which sensitive or private information might still be retrievable.
The researchers introduce a novel approach called Anonymized Activation (AnonAct) Steering. This method involves generating anonymized versions of questions related to the unlearned topic and calculating steering vectors based on the differences in internal model representations between the original and anonymized questions. These vectors are then used to guide the model's output during generation, potentially increasing the frequency of correct answers that reveal unlearned information. The authors evaluate AnonAct Steering on three different unlearning methods (WhoIsHarryPotter, TOFU, and ROME) and corresponding datasets, analyzing the frequency of correct answers generated with and without their method.
The research demonstrates that activation steering can be a powerful tool for evaluating the robustness of LLM unlearning techniques. While the method shows promise in revealing vulnerabilities, its effectiveness varies depending on the scope of the unlearned subject matter. Unlearning broader topics with numerous interlinked concepts appears more challenging and susceptible to information leakage through activation steering.
This study contributes valuable insights into the ongoing challenge of developing truly secure and private LLMs. It emphasizes the need for more robust unlearning methods that effectively remove sensitive information and prevent its retrieval through advanced techniques like activation steering.
The authors acknowledge that AnonAct Steering's effectiveness is influenced by the breadth of the unlearned topic. Future research could explore alternative activation steering approaches or combine them with other methods to improve the retrieval of specific or granular unlearned information. Additionally, investigating the generalizability of these findings across different LLM architectures and unlearning techniques is crucial.
To Another Language
from source content
arxiv.org
Deeper Inquiries