The paper proposes a pipeline for quantitatively evaluating interactive large language models (LLMs) such as ChatGPT using a publicly available dataset called Big-Vul. The authors carry out an extensive technical evaluation of ChatGPT across five different common software vulnerability tasks:
Vulnerability Detection: The study finds that existing state-of-the-art methods generally outperform ChatGPT in software vulnerability detection, though ChatGPT excels at detecting certain vulnerability types like null pointer and access control issues. However, ChatGPT can be easily induced to change its vulnerability classifications, indicating low confidence.
Vulnerability Assessment: ChatGPT has limited capacity for assessing vulnerability severity based on source code alone, but its performance improves significantly when provided with additional context information. However, its accuracy in predicting severity ratings varies across different vulnerability types.
Vulnerability Localization: ChatGPT exhibits a certain capability in locating vulnerabilities, but its performance varies among different vulnerability types.
Vulnerability Repair: ChatGPT has limited ability in repairing vulnerabilities, regardless of whether context information is provided or not. Its performance in repair can be attributed to its ability to recognize and understand specific vulnerability patterns.
Vulnerability Description: ChatGPT has uneven performance in generating CVE descriptions for various vulnerability types, with limited accuracy in providing detailed information. Its description ability is affected by factors like training data, complexity, and language specificity.
Overall, the paper provides valuable insights for further enhancing ChatGPT's software vulnerability handling capabilities.
翻譯成其他語言
從原文內容
arxiv.org
深入探究