Negating Negatives: Achieving Alignment with Human-Annotated Negative Samples for Large Language Models
Proposing Distributional Dispreference Optimization (D2O) to achieve alignment using solely human-annotated negative samples, reducing harmfulness while maintaining helpfulness.