The core message of this article is that the Rashomon Importance Distribution (RID) framework provides a stable and comprehensive way to quantify variable importance, overcoming the limitations of existing methods.
The key insights are:
Current variable importance methods are limited because they either focus on a single model (ignoring the Rashomon effect) or compute a range of variable importance values across the Rashomon set (ignoring instability across data perturbations).
RID constructs the cumulative distribution function (CDF) of variable importance across the Rashomon set, weighted by the proportion of datasets for which each model is in the Rashomon set. This accounts for both the Rashomon effect and stability.
Theoretical results show that RID converges to the true variable importance distribution for the data generating process as the number of bootstrap samples increases and the Rashomon set approximates the true data distribution.
Experiments on synthetic datasets demonstrate that RID can accurately distinguish important variables from extraneous ones, and better captures the true variable importance compared to existing methods.
A case study on predicting HIV load from single-cell transcriptomic and chromatin data identifies an unexplored gene, LINC00486, as the most important variable, suggesting a novel direction for future HIV research.
Overall, RID provides a flexible, stable, and comprehensive framework for variable importance analysis that can be integrated with most existing variable importance metrics.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Jon Donnelly... om arxiv.org 04-03-2024
https://arxiv.org/pdf/2309.13775.pdfDiepere vragen