Temel Kavramlar
Fine-tuned ViT models outperform prompt-engineered LMMs in cybersecurity tasks.
Özet
This study compares the performance of prompt-engineered Large Multimodal Models (LMMs) and fine-tuned Vision Transformer (ViT) models in two cybersecurity tasks: trigger detection and malware classification. The results show that ViT models excel in both tasks, achieving perfect accuracy in trigger detection and high accuracy in malware classification. In contrast, LMMs, despite prompt engineering, struggle in tasks requiring detailed visual comprehension and pattern recognition. The study highlights the superior effectiveness of fine-tuned ViTs in handling diverse cybersecurity challenges.
Directory:
Abstract
Large Multimodal Models (LMMs) and Vision Transformer (ViT) models compared in cybersecurity tasks.
Introduction
Rise of LMMs and ViTs in processing text and images.
Background and Preliminaries
LMMs, ViT models, prompt engineering, and fine-tuning explained.
Methodology
Prompt engineering for LMMs and fine-tuning ViT models detailed.
Experiments
Experimental setup, datasets, models used, and evaluation metrics described.
Task 1: Trigger Detection
Prompt engineering with Gemini-pro and fine-tuning ViT models for trigger detection.
Task 2: Malware Classification
Prompt engineering with Gemini-pro and fine-tuning ViT models for malware classification.
Discussion
Comparative analysis of LMMs and ViTs in cybersecurity tasks.
Conclusion
ViT models outperform LMMs in cybersecurity tasks.
İstatistikler
The ViTs achieved perfect accuracy in trigger detection.
ViT models achieved an accuracy of 97.12% for classifying among 25 malware types and 98.00% for classifying among 5 malware families.
Alıntılar
"ViT models demonstrated exceptional accuracy, achieving near-perfect performance on both tasks."
"The ViT models, on the other hand, demonstrate exceptional accuracy, achieving near-perfect performance on both tasks."