The paper presents FairPair, a framework for evaluating bias in language models. FairPair operates by constructing counterfactual pairs of text continuations, where one continuation is generated from a prompt with one demographic entity (e.g., John) and the other is generated from the same prompt but with the entity perturbed to a different demographic (e.g., Jane).
Crucially, FairPair grounds the comparison in the same demographic entity, ensuring a fair evaluation not influenced by the mere presence of different entities. It also accounts for the inherent variability in the generation process by sampling multiple continuations for each prompt.
The authors evaluate several commonly used language models using FairPair on a new dataset called Common Sents, which contains natural-sounding sentences. They find that larger models like LLaMa and InstructGPT exhibit higher bias relative to their sampling variability, indicating that the differences between the continuations cannot be fully explained by the generation process alone.
Qualitative analysis of the prevalent n-grams in the continuations also reveals differential treatment, with prompts starting with "John" tending to discuss more about occupational capabilities, while prompts starting with "Jane" discussing more about family, hobbies, and personality traits.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor