This paper investigates the fairness of abstractive summarization by large language models (LLMs). The authors first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people. They then propose four reference-free automatic metrics to measure fairness by comparing the distribution of target and source perspectives.
The authors evaluate nine LLMs, including GPT, LLaMA, PaLM 2, and Claude, on six datasets covering diverse domains such as social media, online reviews, and recorded transcripts. The results show that both human-written reference summaries and LLM-generated summaries often suffer from low fairness.
The authors conduct a comprehensive analysis to identify common factors influencing fairness, such as decoding temperature and summary length. They also propose three simple but effective methods to improve fairness, including changing decoding temperature, adjusting summary length, and appending the definition of fairness to the instruction prompt.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yusen Zhang,... at arxiv.org 04-02-2024
https://arxiv.org/pdf/2311.07884.pdfDeeper Inquiries