Core Concepts
Gender disparities persist in LLM responses despite advancements, impacting factuality and fairness metrics.
Abstract
This study evaluates the use of Large Language Models (LLMs) to retrieve factual information, focusing on gender-based biases in responses. Findings reveal discernible gender disparities in GPT-3.5 responses, with improvements but not eradication in GPT-4. The study explores factors influencing these disparities, including industry and company name associations. New fairness metric RCS shows GPT-4's improved fairness over GPT-3.5. Gender differences exist in recall rates, declination patterns, and hallucinated names, highlighting ongoing challenges in LLM performance.
Stats
"Our findings reveal discernible gender disparities in the responses generated by GPT-3.5."
"GPT-4 has led to improvements but has not fully eradicated these gender disparities."
"Female Nobel Prize winners were significantly more likely to be recalled than male Nobel Prize winners."