Core Concepts
A multilevel stochastic optimization approach based on computational applied mathematics techniques can accurately and efficiently impute missing values in massive medical datasets, significantly outperforming current state-of-the-art methods.
Abstract
The paper introduces a novel multilevel stochastic optimization approach based on computational applied mathematics techniques to address the challenge of missing data imputation in massive medical datasets.
Key highlights:
- Missing data is a critical problem in many large healthcare datasets, such as the National Inpatient Sample (NIS) and State Inpatient Databases (SID), which can lead to biased estimates and misinformed policy decisions.
- Current imputation methods recommended by the HCUP report, such as Predicted Mean Matching (PMM) and Predicted Posterior Distribution (PPD), often suffer from suboptimal accuracy, especially for noisy signals.
- The proposed multilevel Kriging/Best Linear Unbiased Predictor (BLUP) method is highly accurate and numerically stable, addressing the challenges of ill-conditioned covariance matrices that plague traditional Kriging approaches.
- The multilevel formulation is exact and significantly faster (up to 75% reduction in error) compared to current methods, making it suitable for practical application to massive datasets.
- Benchmark tests on the NIS dataset show the multilevel Kriging/BLUP method outperforms state-of-the-art techniques, including discriminative deep learning, in imputing the total charge variable with high missing data rates.
- The multilevel approach can be extended to handle categorical variables and quantify imputation uncertainty through techniques like Karhunen-Loève expansion.
Stats
The total charge variable in the NIS 2013 dataset has a 2% missing data rate.
The Michigan SID dataset has a 19.79% missing data rate for the total charge variable.
Quotes
"Missing data form an important problem in medical record datasets. In particular, the HCUP Report #2015-01 by [21], stresses the need to address missing data in the National Inpatient Sample (NIS) and State Inpatient Databases (SID)."
"Current imputation algorithms recommended by the HCUP report #2015-01 include Predicted Mean Matching (PMM), Predicted Posterior Distribution (PPD) and linear regression ( [22], [23]). These algorithms often are sub-optimal, in particular for noisy signals."