Core Concepts
Large language models (LLMs) have significant potential in cybersecurity applications, but their reliability and truthfulness remain a concern. The SECURE benchmark comprehensively evaluates LLM performance in realistic cybersecurity scenarios to ensure their trustworthiness as cyber advisory tools.
Abstract
The authors introduce the SECURE (Security Extraction, Understanding & Reasoning Evaluation) benchmark to assess the performance of large language models (LLMs) in cybersecurity-related tasks. SECURE includes six datasets focused on the Industrial Control System (ICS) sector, covering knowledge extraction, understanding, and reasoning based on industry-standard sources like MITRE, CVE, and CISA.
The key highlights of the SECURE benchmark are:
-
Knowledge Extraction Tasks:
- MAET (MITRE Attack Extraction Task) and CWET (Common Weakness Extraction Task) evaluate the ability of LLMs to accurately recall facts from MITRE ATT&CK and CWE databases.
-
Knowledge Understanding Tasks:
- KCV (Knowledge test on Common Vulnerabilities) assesses the LLMs' comprehension of newly introduced CVEs.
- VOOD (Vulnerability Out-of-Distribution task) evaluates the models' ability to recognize when they lack sufficient information to answer a question.
-
Knowledge Reasoning Tasks:
- RERT (Risk Evaluation Reasoning Task) measures the LLMs' ability to summarize and reason about complex cybersecurity advisory reports from CISA.
- CPST (CVSS Problem Solving Task) tests the models' problem-solving skills in computing CVSS scores.
The authors evaluate seven state-of-the-art LLMs, including both open-source (Llama3-70B, Llama3-8B, Mistral-7B, Mixtral-8x7b) and closed-source (ChatGPT-3.5, ChatGPT-4, Gemini-Pro) models, on these tasks. The results show that while LLMs demonstrate some capability in cybersecurity tasks, their use as cyber advisory tools requires careful consideration due to issues like hallucinations, truthfulness, and out-of-distribution performance.
The authors provide insights and recommendations to enhance the usability of LLMs in cybersecurity applications. They also release the SECURE benchmark datasets and framework for the security community to evaluate future LLMs.
Stats
The SECURE benchmark includes a total of 2036 multiple-choice questions in the MAET and CWET datasets.
The KCV and VOOD datasets contain 466 boolean questions each, based on CVEs published in 2024.
The RERT dataset includes 1000 samples from CISA security advisories, and the CPST dataset has 100 CVSS3.1 vector strings.
Quotes
"Recent breakthroughs in large language models (LLM) like OpenAI's ChatGPT [6] have opened up their applications in many domains, including security [66]."
"Despite these advancements, there remains a significant gap in the evaluation of LLMs specifically tailored for security industries such as information security, network security, and critical infrastructure protection."
"To address this gap, we introduce a comprehensive benchmarking framework encompassing real-world cybersecurity scenarios, practical tasks, and applied knowledge assessments."