Concepts de base
A comprehensive customer-level dataset for advancing machine learning research and developing effective fraud detection systems.
Résumé
The study introduces the Customer-level Fraud Detection Benchmark (CFDB), a structured dataset designed to overcome the limitations of traditional transaction-level fraud detection datasets. The CFDB aggregates customer-level data from three datasets - SAML-D, AML-World-HI-Small, and AML-World-LI-Small - to provide a more comprehensive view of customer behavior patterns and facilitate the development of sophisticated machine learning models for fraud detection.
The key highlights of the CFDB include:
- Transformation of transaction-level data into customer profiles, capturing detailed behavioral patterns and transactional histories at the individual customer level.
- Provision of a standardized benchmark for evaluating the performance of various machine learning models in detecting fraudulent activities, using metrics such as precision, recall, accuracy, AUC, and F1 score.
- Facilitation of collaborative research efforts and cross-institutional knowledge sharing to drive innovation in fraud detection technologies.
The study evaluates the performance of several baseline machine learning models, including Linear Regression, Decision Tree, XGBoost, and Neural Network, on the CFDB. The results reveal the strengths and limitations of each model, highlighting the importance of using a multifaceted approach to accurately assess model performance, especially in the context of imbalanced datasets typical of fraud detection tasks.
The findings emphasize the need for ongoing research and development to refine machine learning techniques, address data imbalances, and explore hybrid approaches that leverage the complementary strengths of various models. By contributing the CFDB as a valuable resource, the study aims to set a new standard in fraud detection research and empower the development of next-generation fraud detection systems.
Stats
The SAML-D dataset contains 9,504,852 transactions, with 0.1039% of transactions flagged as suspicious.
The AML-World-HI-Small dataset contains 6,924,049 transactions, with 0.751% of customers flagged as suspicious.
The AML-World-LI-Small dataset contains 5,078,345 transactions, with 1.23% of customers flagged as suspicious.
Citations
"The availability of comprehensive and privacy-compliant datasets is crucial for advancing machine learning research and developing effective anti-fraud systems."
"By offering the dataset to the research community, we aim to set a new standard in fraud detection research, providing a tool that can significantly enhance the predictive accuracy of fraud detection systems."
"This initiative not only fosters innovation in machine learning model development but also contributes to safer banking practices, ultimately protecting consumers and financial institutions alike from the perils of fraudulent activities."