This study proposes an innovative pipeline that leverages large language models (LLMs) like GPT-3.5 and GPT-4 as synthetic experts to generate high-quality synthetic edit feedback. This feedback is then used to align weaker language models (LMs) like GPT-2 and Llama-2 towards factual accuracy in the clinical summarization task.
The key highlights are:
The synthetic edit feedback is generated in two directions:
The generated synthetic edit feedback is used to train the weaker LMs using two alignment algorithms: DPO and SALT.
Experiments show that using the synthetic edit feedback can significantly improve the factual accuracy of the weaker LM-generated summaries. For Llama-2, DPO led to a 2.44% increase in ROUGEL and a 1.35% increase in factuality, while SALT resulted in a 2.47% increase in ROUGEL and a 2.04% increase in factuality.
For GPT-2, DPO led to a 3.04% increase in ROUGEL and a 2.93% increase in factuality, while SALT yielded a 4.04% increase in ROUGEL and a 4.64% increase in factuality.
The top-performing model achieved a 78% preference rate for factuality among human evaluators, highlighting its superior performance.
The study demonstrates the substantial potential of LLM-based synthetic edits in enhancing the factual alignment of clinical summarization, addressing the critical challenge of hallucinations in generative AI.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Prakamya Mis... at arxiv.org 04-18-2024
https://arxiv.org/pdf/2402.13919.pdfDeeper Inquiries