Lutz, M., Choenni, R., Strohmaier, M., & Lauscher, A. (2024). Local Contrastive Editing of Gender Stereotypes. arXiv preprint arXiv:2410.17739v1.
This research paper investigates the localization of individual weights in language models (LMs) that contribute to stereotypical gender bias and explores methods to modify these weights for bias mitigation.
The researchers developed a novel two-step approach called "local contrastive editing." First, they identified weights associated with gender stereotypes by comparing subnetworks extracted from LMs trained on datasets intentionally designed to be either stereotypical or anti-stereotypical. They employed unstructured magnitude pruning to discover these subnetworks. Second, they applied various editing strategies, including weight interpolation, extrapolation, and pruning, to adjust the identified weights in the target model relative to a reference model. The effectiveness of these strategies was evaluated using established bias benchmarks (WEAT, StereoSet, CrowS-Pairs) and performance metrics (perplexity, language modeling score).
The research provides evidence that localizing and editing specific weights in LMs can effectively control and mitigate encoded gender stereotypes. The proposed contrastive editing strategies offer a promising avenue for developing parameter-efficient bias mitigation techniques.
This work contributes significantly to understanding how stereotypical biases manifest in the parameter space of LMs. It offers a novel approach to bias mitigation that is more targeted and potentially less disruptive to the model's overall performance compared to traditional fine-tuning methods.
The study was limited to a single model architecture (BERT) and a binary specification of gender bias. Future research should explore the generalizability of these findings to other architectures, bias types, and more nuanced representations of gender. Additionally, investigating the long-term effects and potential unintended consequences of local contrastive editing is crucial.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문