Belangrijkste concepten
A novel method for missing data imputation using granular semantics and an AI-driven pipeline to accurately predict bankruptcy in large, high-dimensional, and imbalanced financial datasets.
Samenvatting
The paper presents a two-stage approach for bankruptcy prediction:
- Missing Data Imputation with Granular Semantics:
- The method leverages the merits of granular computing to form contextual granules around missing values, considering the most semantically relevant features and a small set of reliable observations.
- This enables efficient imputation within the small granules, without the need to access the entire large dataset repeatedly for each missing value.
- The granular imputation method is shown to outperform other benchmark imputation techniques, especially with increasing rates of missing data.
- AI-Driven Bankruptcy Prediction Pipeline:
- After filling in the missing values, the pipeline performs feature selection using Random Forest to reduce dimensionality.
- It then balances the highly imbalanced dataset using the Synthetic Minority Oversampling Technique (SMOTE).
- Finally, the pipeline tests six different classifiers, including Logistic Regression, Random Forest, and Deep Neural Network, for bankruptcy prediction.
- The proposed pipeline achieves accuracy around 90% and AUC around 0.8-0.9 across the five-year Polish bankruptcy dataset, demonstrating its effectiveness in handling large, high-dimensional, and imbalanced financial data.
Statistieken
The Polish bankruptcy dataset contains 64 quantitative features and observations from 2007-2013, with significant missing values ranging from 4,666 to 12,157 across the five years.
Citaten
"The merits of granular computing have been explored here to define this method. The missing values have been predicted using the feature semantics and reliable observations in a low-dimensional space, that is, in the granular space."
"The granules are formed around every missing entry, considering a few of the highly correlated features to that of the missing value. A small set of the most reliable closest observations is used in granule formation to preserve the relevance and reliability, that is, the context, of the database against the missing entries within those small granules."