toplogo
Sign In

Evaluating Gender Bias in Multilingual Masked Language Models


Core Concepts
Multilingual masked language models exhibit varying degrees of gender bias, which can be more reliably assessed using a novel model-based sentence generation method and strict bias metrics.
Abstract
The paper presents a comprehensive approach to evaluating gender bias in multilingual masked language models (MLMs). It identifies limitations in previous work and proposes two novel methods to generate sentence pairs for a more robust analysis: Lexicon-based Sentence Generation (LSG): Extracts sentences containing single gender words from a corpus and generates counterpart sentences by replacing the gender word. Model-based Sentence Generation (MSG): Masks the gender word in extracted sentences and uses the MLM to predict the most likely male and female words based on the context. The paper also introduces three evaluation metrics: Multilingual Bias Evaluation (MBE): Compares the All Unmasked Likelihood with Attention (AULA) scores between male and female sentences. Strict Bias Metric (SBM): Compares the likelihoods of only the gender words between parallel sentences. Direct Comparison Bias Metric (DBM): Directly compares the MLM's prediction scores for male and female words. The authors create multilingual gender lexicons for 5 languages (Chinese, English, German, Portuguese, Spanish) and evaluate the gender bias of MLMs trained on these languages using the proposed methods. The results show that the previous approach is data-sensitive and not stable, as the bias direction often flips when different scoring metrics are used. In contrast, the model-based method (MSG) provides more consistent and reliable evaluations of gender bias in multilingual MLMs.
Stats
The TED corpus contains about 427,436 English sentences, many of which have been translated into over 100 languages. The authors randomly sampled 11,000 English sentences from the TED corpus and found that the coverage rates of their multilingual gender lexicon range from 19.0% for Arabic to 91.7% for German. The total number of extracted sentences for bias evaluation ranges from 30,547 for Chinese to 114,168 for Spanish.
Quotes
"Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based Masked Language Models (MLMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is needed more than ever." "Our results show that the previous approach is data-sensitive and not stable as it does not remove contextual dependencies irrelevant to gender. In fact, the results often flip when different scoring metrics are used on the same dataset, suggesting that gender bias should be studied on a large dataset using multiple evaluation metrics for best practice."

Deeper Inquiries

How do the gender bias evaluation results differ when using cross-lingual language models compared to language-specific models?

When comparing gender bias evaluation results between cross-lingual language models and language-specific models, several differences can be observed. Cross-lingual language models have the advantage of being able to transfer knowledge and representations across multiple languages, allowing for a more comprehensive assessment of gender bias in a multilingual context. This broader perspective can reveal patterns and biases that may not be apparent when focusing solely on individual languages. Additionally, cross-lingual models can help identify biases that are consistent across languages or those that are unique to specific linguistic contexts. On the other hand, language-specific models may provide more nuanced insights into gender bias within a particular language, capturing intricacies and nuances that could be overlooked in a broader cross-lingual analysis. The choice between using cross-lingual or language-specific models depends on the research goals and the scope of the bias evaluation study.

What other linguistic features beyond gender words could be leveraged to more comprehensively assess gender bias in language models?

In addition to gender words, several other linguistic features could be leveraged to more comprehensively assess gender bias in language models. These features include grammatical structures that encode gender information, such as gendered articles, adjectives, pronouns, and verb conjugations. By analyzing how these elements interact with gendered words and how they contribute to the overall gender bias in language models, researchers can gain a deeper understanding of how biases manifest in different linguistic contexts. Furthermore, examining the distribution of gendered terms in various syntactic and semantic roles within sentences can provide valuable insights into how gender bias is propagated and reinforced in language models. By considering a wide range of linguistic features beyond just gender words, researchers can paint a more holistic picture of gender bias in language models.

How might the length of sentences in language models impact the gender bias scores, and what further investigation is needed to understand this relationship?

The length of sentences in language models can impact gender bias scores in several ways. Longer sentences may contain more contextual information and linguistic cues that could influence the prediction of gendered terms by the model. Additionally, the presence of gendered words in different positions within a sentence, such as at the beginning, middle, or end, could affect the model's interpretation of gender bias. Longer sentences may also introduce more complexity and ambiguity, leading to potential challenges in accurately assessing gender bias. Further investigation is needed to understand how sentence length interacts with gender bias scores, including analyzing the distribution of gendered terms in sentences of varying lengths, examining how sentence structure influences bias predictions, and exploring the role of sentence context in shaping gender bias outcomes. By conducting in-depth analyses across different sentence lengths, researchers can uncover the nuances of gender bias in language models and develop more robust evaluation frameworks.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star