Language models can be fine-tuned to generate well-calibrated linguistic expressions of uncertainty that accurately reflect the likelihood of their predictions being correct.
Uncertainty measures for language models should be evaluated based on their ability to accurately reflect the expected correctness of generated outputs, without relying on ad hoc thresholding of correctness scores.