The content discusses the importance of correctly quantifying the uncertainty of language models (LMs), as they often generate incorrect or hallucinated responses. While various uncertainty measures have been proposed, such as semantic entropy, affinity-graph-based measures, and verbalized confidence, they differ greatly in their output ranges and it is unclear how to compare them.
The authors introduce a novel framework, termed Rank-Calibration, to assess the quality of uncertainty and confidence measures for LMs. The key idea is that lower uncertainty (or higher confidence) should imply higher generation quality, on average. The Rank-Calibration Error (RCE) is proposed as a metric to quantify deviations from this ideal relationship, without requiring ad hoc binary thresholding of the correctness score.
The authors demonstrate the broader applicability and granular interpretability of their methods through experiments on various datasets and language models, including Llama-2-7b, Llama-2-7b-chat, and GPT-3.5-turbo. They also conduct comprehensive ablation studies to examine the robustness of their assessment framework.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Xinmeng Huan... kl. arxiv.org 04-05-2024
https://arxiv.org/pdf/2404.03163.pdfDybere Forespørgsler