Abstract
Breast cancer is one of the most serious diseases and a leading cause of cancer-related deaths for women worldwide. This study evaluates and compares the performance of several supervised machine learning algorithms for breast cancer tumor classification, using a real-world dataset (sourced from Kaggle.com). From an initial 212 observations, the final dataset was reduced to 205 after handling missing values. We employed logistic regression, decision tree, random forest, and support vector machines (SVMs) with various kernels, focusing on model accuracy, feature importance, and the impact of dimensionality reduction. All models demonstrated strong performance, with accuracies above 87%. The most effective classifiers were the random forest and polynomial SVM, achieving the highest area under the curve (AUC) values of 96.3% and 96.9%, respectively. Feature importance analysis consistently identified Tumor Size, Involved Lymph Nodes, Metastasis, and Age as the most significant predictors. The high accuracy of simpler models, such as logistic regression and a linear SVM, is attributed to the dataset's inherent linear separability. Our findings also validate the use of principal component analysis (PCA) for feature reduction, as key models maintained high performance on the simplified dataset.