While widely used for mitigating harms in large language models, the Helpful and Harmless (HH) dataset exhibits significant shortcomings, potentially leading to exaggerated safety behaviors and perpetuating harmful associations with demographic groups.
This research paper introduces a novel approach to enhance the safety of large language models (LLMs) by developing a benchmark to evaluate and improve their ability to self-correct, or "course-correct," during the generation of harmful content.
This research paper introduces ChineseSafe, a comprehensive benchmark dataset designed to evaluate the safety of large language models (LLMs) specifically in handling Chinese content, addressing the limitations of existing benchmarks by including categories like political sensitivity, pornography, and variant/homophonic words.
Deactivating specific neurons in large language models (LLMs) can mitigate the trade-off between fairness and privacy awareness, leading to simultaneous improvements in both areas without compromising general capabilities.
This paper proposes a framework that combines prompt engineering and legal knowledge graphs to improve the safety of large language models (LLMs) by identifying and explaining potential legal implications of LLM-generated recommendations.
This research paper presents evidence that specific attention heads within large language models (LLMs) play a crucial role in safety mechanisms, and that ablating these heads can significantly compromise the model's ability to reject harmful queries while having minimal impact on helpfulness.
대규모 언어 모델의 안전성과 유용성 사이의 균형을 맞추는 것은 매우 중요하며, 고급 교사 모델을 사용한 훈련 데이터 과생성과 선호도 최적화 기법을 통해 안전성을 유지하면서 과도한 거부를 줄일 수 있습니다.
Overgeneration of training data using advanced language models, combined with preference optimization techniques, can effectively improve the safety and reduce overrefusal in large language models, leading to a better balance between safety and usefulness.
The PKU-SafeRLHF dataset aids in aligning large language models with human preferences for safety and helpfulness by providing a large-scale resource for training and evaluating safety alignment techniques.
Fine-tuning large language models, even on seemingly benign datasets, can introduce significant safety risks, and SafetyLock offers a novel, efficient, and transferable solution to mitigate these risks by identifying and steering specific safety-sensitive attention heads within the model's architecture.