Unlearning aims to efficiently eliminate the influence of specific undesirable data and associated model capabilities from pre-trained large language models, while preserving their essential knowledge generation and generalization abilities.
Deterministic evaluations of unlearning in Large Language Models (LLMs) are insufficient, failing to capture potential information leakage present in the full output distribution; a probabilistic evaluation framework employing novel metrics is proposed to address this, alongside an entropy optimization approach for more effective unlearning.
Existing unlearning methods for Large Language Models (LLMs) often fail to differentiate between knowledge that should be forgotten (e.g., copyrighted content, private information) and knowledge that should be retained (e.g., public domain information), leading to over-forgetting and hindering the development of practical unlearning techniques.
Fine-tuning-based unlearning methods, while seemingly effective in behavioral tests, fail to genuinely erase targeted knowledge from large language models and instead primarily alter the knowledge retrieval process, potentially impacting the model's performance on unrelated tasks.
This research paper proposes novel approaches to address the challenges of unlearning sensitive or copyrighted content from large language models (LLMs) while preserving their overall performance and mitigating risks of hallucinations and excessive ignorance.
This research introduces novel techniques, Inverted Hinge Loss (IHL) and Fisher-weighted Initialization of Low-Rank Adapters (FILA), to efficiently and effectively unlearn sensitive information from Large Language Models (LLMs) while preserving overall performance and mitigating catastrophic forgetting.
This paper introduces FLAT, a novel LLM unlearning method that effectively removes the influence of specific data from trained models while preserving overall performance and general knowledge, all without relying on retain data or a reference LLM.
UNDIAL is a novel unlearning method for large language models that leverages self-distillation with adjusted logits to robustly mitigate the retention of sensitive information while preserving the model's language capabilities, addressing the limitations of existing unlearning techniques.