The researchers collected a benchmark of 15 real-world one-day vulnerabilities from the Common Vulnerabilities and Exposures (CVE) database and academic papers. They developed an LLM agent using GPT-4 as the base model, along with a prompt, the ReAct agent framework, and access to various tools.
The key findings are:
GPT-4 achieved an 87% success rate in exploiting the one-day vulnerabilities, while every other LLM model (GPT-3.5, 8 open-source models) and open-source vulnerability scanners (ZAP and Metasploit) had a 0% success rate.
When the CVE description was removed, GPT-4's success rate dropped to 7%, suggesting that determining the vulnerability is more challenging than exploiting it.
The researchers found that GPT-4 was able to identify the correct vulnerability 33.3% of the time (55.6% for vulnerabilities past the knowledge cutoff date) but could only exploit one of the successfully detected vulnerabilities.
The average cost of using GPT-4 to exploit the vulnerabilities was $3.52 per run, which is 2.8 times cheaper than estimated human labor costs.
The results demonstrate the emergent capabilities of LLM agents, specifically GPT-4, in the realm of cybersecurity and raise important questions about the widespread deployment of such powerful agents.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Richard Fang... alle arxiv.org 04-15-2024
https://arxiv.org/pdf/2404.08144.pdfDomande più approfondite