This study presents an empirical evaluation of the fairness of Large Language Models (LLMs) when used as text rankers. The authors focus on assessing the representation of binary protected attributes, such as gender and geographic location, in the ranking outcomes of these models.
The study utilizes the TREC Fair Ranking dataset, which contains queries and associated documents annotated with protected attributes. The authors conduct both listwise and pairwise evaluations to assess fairness from different perspectives:
Listwise Evaluation:
Pairwise Evaluation:
To address the fairness issues observed in the pairwise evaluation, the authors fine-tune the Mistral-7b model using the Low-Rank Adaptation (LoRA) technique. The results show that the LoRA-adjusted model achieves fairness ratios closer to the ideal benchmark of 1.0, indicating more equitable treatment of protected and non-protected groups.
The study highlights the importance of considering fairness alongside effectiveness when evaluating LLMs as rankers, and provides a comprehensive benchmark for assessing the fairness of these models.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yuan Wang,Xu... at arxiv.org 04-05-2024
https://arxiv.org/pdf/2404.03192.pdfDeeper Inquiries