Breast Cancer Data Analysis Using Supervised Machine Learning Algorithms

利用监督式机器学习算法进行乳腺癌数据分析

阅读:1

Abstract

Breast cancer is one of the most serious diseases and a leading cause of cancer-related deaths for women worldwide. This study evaluates and compares the performance of several supervised machine learning algorithms for breast cancer tumor classification, using a real-world dataset (sourced from Kaggle.com). From an initial 212 observations, the final dataset was reduced to 205 after handling missing values. We employed logistic regression, decision tree, random forest, and support vector machines (SVMs) with various kernels, focusing on model accuracy, feature importance, and the impact of dimensionality reduction. All models demonstrated strong performance, with accuracies above 87%. The most effective classifiers were the random forest and polynomial SVM, achieving the highest area under the curve (AUC) values of 96.3% and 96.9%, respectively. Feature importance analysis consistently identified Tumor Size, Involved Lymph Nodes, Metastasis, and Age as the most significant predictors. The high accuracy of simpler models, such as logistic regression and a linear SVM, is attributed to the dataset's inherent linear separability. Our findings also validate the use of principal component analysis (PCA) for feature reduction, as key models maintained high performance on the simplified dataset.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。