The content starts by highlighting the ubiquity of classification evaluation in machine learning, but notes that the selection of appropriate evaluation metrics is often nebulous and lacks clear arguments. The author then introduces five key properties to analyze classification evaluation metrics: monotonicity, class sensitivity, class decomposability, prevalence invariance, and chance correction.
The paper then conducts a detailed analysis of common evaluation metrics, including accuracy, macro recall, macro precision, macro F1, weighted F1, Kappa, and Matthews Correlation Coefficient (MCC). The analysis reveals that these metrics differ in their properties and the implicit assumptions they make about the desired characteristics of a classifier.
The author also introduces metric variants, such as geometric and harmonic mean versions of macro recall, and discusses the concept of prevalence calibration to ensure prevalence invariance. Additionally, the paper examines the practice of metric selection in recent shared tasks, finding that clear justifications for metric choices are often lacking.
The paper concludes with a set of recommendations for researchers, emphasizing the importance of clearly stating the evaluation metric, building a case for the chosen metric, considering the presentation of multiple complementary metrics, and potentially admitting multiple "winning" systems when a single best metric cannot be determined.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문