The researchers collected a benchmark of 15 real-world one-day vulnerabilities from the Common Vulnerabilities and Exposures (CVE) database and academic papers. They developed an LLM agent using GPT-4 as the base model, along with a prompt, the ReAct agent framework, and access to various tools.
The key findings are:
GPT-4 achieved an 87% success rate in exploiting the one-day vulnerabilities, while every other LLM model (GPT-3.5, 8 open-source models) and open-source vulnerability scanners (ZAP and Metasploit) had a 0% success rate.
When the CVE description was removed, GPT-4's success rate dropped to 7%, suggesting that determining the vulnerability is more challenging than exploiting it.
The researchers found that GPT-4 was able to identify the correct vulnerability 33.3% of the time (55.6% for vulnerabilities past the knowledge cutoff date) but could only exploit one of the successfully detected vulnerabilities.
The average cost of using GPT-4 to exploit the vulnerabilities was $3.52 per run, which is 2.8 times cheaper than estimated human labor costs.
The results demonstrate the emergent capabilities of LLM agents, specifically GPT-4, in the realm of cybersecurity and raise important questions about the widespread deployment of such powerful agents.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Richard Fang... at arxiv.org 04-15-2024
https://arxiv.org/pdf/2404.08144.pdfDeeper Inquiries