Improving Gaussian Naive Bayes classification on imbalanced data through coordinate-based minority feature mining

通过基于坐标的少数特征挖掘改进不平衡数据上的高斯朴素贝叶斯分类

阅读:1

Abstract

As a widely used classification model, the Gaussian Naive Bayes (GNB) classifier experiences a significant decline in performance when handling imbalanced data. Most traditional approaches rely on sampling techniques; however, these methods alter the quantity and distribution of the original data and are prone to issues such as class overlap and overfitting, thus presenting clear limitations. This article proposes a coordinate transformation algorithm based on radial local relative density changes (RLDC). A key feature of this algorithm is that it preserves the original dataset's quantity and distribution. Instead of modifying the data, it enhances classification performance by generating new features that more prominently represent minority classes. The algorithm transforms the dataset from absolute coordinates to RLDC-relative coordinates, revealing latent local relative density change features. Due to the imbalanced distribution, sparse feature space, and class overlap, minority class samples can exhibit distinct patterns in these transformed features. Based on these new features, the GNB classifier can increase the conditional probability of the minority class, thereby improving its classification performance on imbalanced datasets. To validate the effectiveness of the proposed algorithm, this study conducts comprehensive comparative experiments using the GNB classifier on 20 imbalanced datasets of varying scales, dimensions, and characteristics. The evaluation includes 10 oversampling algorithms, two undersampling algorithms, and two hybrid sampling algorithms. Experimental results show that the RLDC-based coordinate transformation algorithm ranks first in the average performance across three classification evaluation metrics. Compared to the average values of the comparison algorithms, it achieves improvements of 21.84%, 33.45%, and 54.63% across the three metrics, respectively. This algorithm offers a novel approach to addressing the imbalanced data problem in GNB classification and holds significant theoretical and practical value.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。