Abstract
Breast cancer is a widespread disease involving abnormal (uncontrolled) growth of breast tissue cells along with the formation of a tumor and metastasis. Breast cancer cases occur mostly among women. Early detection and regular screening have significantly improved survival rates. This research classifies breast cancer and non-breast cancer cases using machine learning algorithms based on the Breast Cancer Coimbra dataset by optimizing the classifier performance and feature selection methodology. In addition, this research identifies the influential features responsible for BC classification by using diverse counterfactual explanations. The Rotation Forest classifier algorithm is used to classify breast cancer and non-breast cancer cases. The hyperparameters of this algorithm are optimized using the Optuna optimizer. Three wrapper-based feature selection techniques (Sequential Forward Selection, Sequential Backward Selection, and Exhaustive Feature Selection) are used to select the most relevant features. An ensemble environment is also created using the best feature subsets of these methods, incorporating both soft and hard voting strategies. Experimental results show that the hard voting strategy achieves an accuracy of 85.71%, F1-score of 83.87%, precision of 92.85%, and recall of 76.47%. In contrast, the soft voting strategy obtains an accuracy of 80.00%, F1-score of 77.42%, precision of 85.71%, and recall of 70.59%. These findings demonstrate that hard voting achieves noticeably better performance. The misclassification outcomes of both strategies are explored using Diverse Counterfactual Explanations, revealing that BMI and Glucose values are most influential in predicting correct classes, whereas the HOMA, Adiponectin, and Resistin values have little influence.