The content discusses the importance of correctly quantifying the uncertainty of language models (LMs), as they often generate incorrect or hallucinated responses. While various uncertainty measures have been proposed, such as semantic entropy, affinity-graph-based measures, and verbalized confidence, they differ greatly in their output ranges and it is unclear how to compare them.
The authors introduce a novel framework, termed Rank-Calibration, to assess the quality of uncertainty and confidence measures for LMs. The key idea is that lower uncertainty (or higher confidence) should imply higher generation quality, on average. The Rank-Calibration Error (RCE) is proposed as a metric to quantify deviations from this ideal relationship, without requiring ad hoc binary thresholding of the correctness score.
The authors demonstrate the broader applicability and granular interpretability of their methods through experiments on various datasets and language models, including Llama-2-7b, Llama-2-7b-chat, and GPT-3.5-turbo. They also conduct comprehensive ablation studies to examine the robustness of their assessment framework.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések