The paper presents a comprehensive evaluation framework called SecLLMHolmes to assess the performance of LLMs in identifying and reasoning about security vulnerabilities in code. The framework tests LLMs across eight distinct dimensions, including deterministic response, faithful reasoning, robustness to code augmentations, and performance on real-world vulnerabilities.
The evaluation was conducted on eight state-of-the-art LLMs using 228 code scenarios spanning 8 critical vulnerabilities in C and Python. The key findings include:
LLM performance varies widely depending on the model and prompting technique used, with all models exhibiting high false positive rates and incorrectly flagging patched code as still vulnerable.
LLM outputs are non-deterministic, with models changing their answers across multiple runs for the same test.
Even when LLMs correctly identify a vulnerability, the reasoning they provide is often incorrect, undermining their trustworthiness.
LLM chain-of-thought reasoning is not robust and can be easily confused by simple code augmentations like changing function/variable names or using related library functions.
LLMs fail to detect vulnerabilities in real-world projects, demonstrating that further advancements are needed before they can be reliably used as security assistants.
The paper concludes that current LLMs are not yet ready to be used for automated vulnerability detection tasks, and the SecLLMHolmes framework can serve as a benchmark for evaluating progress in this domain.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Saad Ullah,M... at arxiv.org 04-16-2024
https://arxiv.org/pdf/2312.12575.pdfDeeper Inquiries