Bibliographic Information: Ovalle, A., Lehman Pavasovic, K., Martin, L., Zettlemoyer, L., Smith, E. M., Williams, A., & Sagun, L. (2024). The Root Shapes the Fruit: On the Persistence of Gender-Exclusive Harms in Aligned Language Models. 38th Neural Information Processing Systems Queer in AI Workshop (NeurIPS 2024). arXiv:2411.03700v1 [cs.CL].
Research Objective: This paper investigates how alignment procedures in large language models (LLMs) interact with pre-existing biases against transgender, non-binary, and gender-diverse (TGNB) individuals, examining whether these biases persist or amplify despite efforts to promote helpful and harmless behavior.
Methodology: The researchers systematically evaluate 12 publicly available LLMs across three stages of preference fine-tuning: pretraining (base model), supervised fine-tuning (SFT), and direct preference optimization (DPO). They employ the TANGO and WINOQUEER benchmarks, specifically designed to assess gender-non-affirmative language and TGNB stigma, respectively. Additionally, they propose a novel framework to analyze implicit reward signals in DPO-aligned models, uncovering potential mechanisms of bias propagation.
Key Findings: The study reveals that DPO-aligned LLMs can amplify existing TGNB biases, even when initialized from relatively unbiased base models. Specifically, they find that:
Main Conclusions: The authors argue that current bias evaluation practices in LLM development, which predominantly focus on binary gender, are insufficient for addressing harms against TGNB individuals. They emphasize the need for:
Significance: This research highlights the urgent need to address the complex interplay between alignment procedures and social biases in LLMs. It provides valuable insights for developing more equitable and inclusive language technologies that do not perpetuate harm against marginalized communities.
Limitations and Future Research: The study acknowledges limitations regarding the specific models and datasets used and suggests further exploration of bias amplification in other alignment regimes like RLHF. Future research could investigate the impact of different preference datasets, model architectures, and mitigation strategies on TGNB bias. Additionally, exploring the social impacts of LLM hallucination and model refusal in the context of TGNB representation is crucial.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Anaelia Oval... at arxiv.org 11-07-2024
https://arxiv.org/pdf/2411.03700.pdfDeeper Inquiries