Core Concepts
Distributional Dispreference Optimization (D2O) achieves alignment using solely human-annotated negative samples, reducing harmfulness while maintaining helpfulness.
Abstract
Large language models (LLMs) revolutionize AI but pose risks of unethical content propagation.
Alignment methods rely on human preference data, facing challenges with noisy labels.
D2O proposes a new approach using only human-annotated negative samples for alignment.
The method maximizes the discrepancy between generated responses and negative ones to avoid harmful information.
Theoretical analysis shows D2O learns a distributional preference model reflecting human dispreference.
Extensive experiments demonstrate D2O's effectiveness in reducing harmfulness and maintaining helpfulness.
Stats
최근 LLM의 능력이 놀라운 다양한 실제 응용 프로그램을 강화하는 것을 보여준다.
D2O는 인간 주석이 달린 부정적인 샘플만 사용하여 조정을 달성한다.
D2O는 생성된 응답과 부정적인 응답 사이의 차이를 최대화하여 해로운 정보를 피한다.
Quotes
"This work pivots towards a new research focus: achieving alignment using solely human-annotated negative samples."
"D2O integrates an implicit Jeffrey Divergence regularization to balance the exploitation and exploration of reference policies."