Evaluating the three-level approach of the U-smile method for imbalanced binary classification

评估 U-smile 方法的三级方法在不平衡二元分类问题中的应用

阅读:1

Abstract

Real-life binary classification problems often involve imbalanced datasets, where the majority class outnumbers the minority class. We previously developed the U-smile method, which comprises the U-smile plot and the BA, RB and I coefficients, to assess the usefulness of a new variable added to a reference prediction model and validated it under class balance. In this study, we evaluated the U-smile method under class imbalance, proposed a three-level approach of the U-smile method, and used the I coefficients as a weighting factor for point size in the U-smile plots of the BA and RB coefficients. Using real data from the Heart Disease dataset and generated random variables, we built logistic regression models to assess four new variables added to the reference model (nested setting). These models were evaluated at seven pre-defined imbalance levels of 1%, 10%, 30%, 50%, 70%, 90% and 99% of the event class. The results of the U-smile method were compared to those of certain traditional measures: Brier skill score, net reclassification index, difference in F1-score, difference in Matthews correlation coefficient, difference in the area under the receiver operating characteristic curve of the new and reference models, and the likelihood-ratio test. The reference model overfitted to the majority class at higher imbalance levels. The BA-RB-I coefficients of the U-smile method identified informative variables across the entire imbalance range. At higher imbalance levels, the U-smile method indicated both prediction improvement in the minority class (positive BA and I coefficients) and reduction in overfitting to the majority class (negative RB coefficients). The U-smile method outperformed traditional evaluation measures across most of the imbalance range. It proved highly effective in variable selection for imbalanced binary classification, making it a useful tool for real-life problems, where imbalanced datasets are prevalent.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。