Kernekoncepter
The area under the precision-recall curve (AUPRC) is not universally superior to the area under the receiver operating characteristic (AUROC) for evaluating binary classification models in the presence of class imbalance. The choice between AUROC and AUPRC should be guided by the specific context and intended use of the model, rather than solely by the class imbalance.
Resumé
This paper examines the relationship between AUROC and AUPRC, and provides a nuanced perspective on when each metric should be used for evaluating binary classification models.
Key insights:
- AUROC and AUPRC differ in how they weigh false positives - AUROC treats all false positives equally, while AUPRC weighs false positives based on the model's likelihood of outputting scores above that threshold.
- AUROC favors model improvements uniformly over all positive samples, whereas AUPRC favors improvements for samples assigned higher scores over those assigned lower scores.
- AUPRC can unduly prioritize improvements to higher-prevalence subpopulations at the expense of lower-prevalence subpopulations, raising serious fairness concerns.
The authors demonstrate these findings theoretically, through synthetic experiments, and with real-world validation on fairness datasets. They also conduct an extensive literature review, finding that the claim of AUPRC being superior to AUROC under class imbalance is often made without proper justification or attribution.
The paper concludes by providing practical guidance on when to use AUROC versus AUPRC, based on the specific context and intended use of the model, such as model comparison, screening for high-cost errors, or equitable resource distribution.
Statistik
"AUROC(f) = 1 - Et∼f(x)|y=1 [FPR(f, t)]"
"AUPRC(f) = 1 - py(0)Et∼f(x)|y=1 [FPR(f, t) / P(f(x) > t)]"
"lim p0→0 P(ai = ai+1 = 1 | i = arg maxj∈M AUPRC(f'j)) = 1"
Citater
"AUPRC can unduely prioritize improvements to higher-prevalence subpopulations at the expense of lower-prevalence subpopulations, raising serious fairness concerns in any multi-population use cases."
"AUROC favors model improvements uniformly over all positive samples, whereas AUPRC favors improvements for samples assigned higher scores over those assigned lower scores."