Core Concepts
The author presents an advanced phishing detection model focusing on HTML content, integrating MLP and NLP models to achieve superior performance in identifying phishing activities.
Abstract
The study addresses the rise of cyber threats, particularly phishing, by introducing a novel detection model that combines Multilayer Perceptron (MLP) for structured data and pretrained Natural Language Processing (NLP) models for textual analysis. The research emphasizes the scarcity of recent datasets for comprehensive phishing studies and contributes by creating an up-to-date dataset reflecting real-life conditions. The proposed MultiText-LP model harmoniously fuses NLP and MLP approaches to achieve impressive results, outperforming existing methods in detecting phishing websites.
The content delves into the methodology employed, including dataset creation from benign and phishing URLs, feature extraction processes encompassing textual and numeric aspects, and the development of models like MLP and pretrained NLP models. Results showcase the effectiveness of the MultiText-LP model with a high F1 score and accuracy on both research and benchmark datasets. The study highlights limitations such as GPU requirements for simultaneous model usage and dataset availability challenges.
In conclusion, the innovative MultiText-LP model emerges as a powerful tool for HTML content classification in phishing detection, surpassing individual NLP or MLP models' performance. Future work aims to integrate URL, HTML content, and WHOIS data while optimizing computational efficiency.
Stats
96.80 F1 score achieved by MultiText-LP model.
97.18 accuracy score obtained by MultiText-LP model.
3.4 billion phishing emails sent daily.
$293,359 average wire transfer amount in BEC attacks.
95% social engineering attack motivations are financially driven.
Quotes
"The fusion of two NLP and one MLP model achieves impressive results."
"Our approach outperforms existing methods on CatchPhish HTML dataset."