Large Language Models (LLMs) exhibit biases in their ranking outcomes, underrepresenting historically marginalized groups. This study empirically evaluates the fairness of popular LLMs, including GPT-3.5, GPT-4, Llama2-13b, and Mistral-7b, as text rankers using listwise and pairwise evaluation methods.
Large language models struggle to effectively rank a small set of items based on multiple, potentially conflicting conditions. A novel decomposed reasoning approach, EXSIR, significantly improves LLM performance on this task.