The paper introduces the problem of calibration for code summarizers, framing "correctness" of generated summaries in terms of being sufficiently similar to human-generated summaries.
The authors examine the calibration of several LLMs, including GPT-3.5-Turbo, CodeLlama-70b, and DeepSeek-Coder-33b, across different programming languages (Java and Python) and prompting techniques (Retrieval Augmented Few-Shot Learning and Automatic Semantic Augmentation of Prompt).
The authors find that the LLMs' own confidence measures (average token probability) are not well-aligned with the actual similarity of the generated summaries to human-written ones. However, by applying Platt scaling, the authors are able to significantly improve the calibration of the LLMs, achieving high skill scores and low Brier scores.
In contrast, the authors find that the LLMs' self-reflective confidence measures (logit-based and verbalized) are not well-calibrated. The paper provides insights into the challenges of calibration for generative tasks like code summarization, where the evaluation metrics are continuous rather than binary.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询