Optimized KNN with domain-informed features and LIME explainability for improved breast cancer classification

基于领域信息特征和LIME可解释性的优化KNN算法可提高乳腺癌分类的准确性

阅读:1

Abstract

BACKGROUND: Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide, with more than 2.3 million new cases and approximately 670,000 deaths reported globally in 2022. Early and accurate diagnosis significantly improves survival rates; however, conventional diagnostic approaches are often time-consuming and subject to inter-observer variability. Although machine learning techniques have demonstrated promising results, many existing studies lack systematic hyperparameter optimization and robust strategies to improve model generalization. This study aimed to develop an optimized and interpretable K-Nearest Neighbour (KNN) framework for breast cancer classification. METHODS: The Breast Cancer Wisconsin (Diagnostic) Dataset (WDBC), comprising 569 samples with 32 features, was used for model development and evaluation. The proposed framework incorporated advanced preprocessing, biologically informed feature engineering, hybrid feature selection, and systematic hyperparameter tuning using GridSearchCV. An ensemble KNN model employing soft voting was introduced to enhance predictive stability and performance. Model interpretability was improved using the Local Interpretable Model-Agnostic Explanations (LIME) technique to identify feature contributions for malignant and benign classifications. RESULTS: The optimized KNN model achieved an accuracy of 98.25%, while the ensemble KNN model reached 99.12% accuracy. The proposed framework demonstrated high predictive performance, improved classification stability, and enhanced interpretability through feature-level explanation analysis. CONCLUSIONS: The findings demonstrate the methodological effectiveness of an optimized and ensemble-based KNN framework for breast cancer classification. While the results indicate strong benchmark performance on the WDBC dataset, the study primarily highlights methodological robustness rather than immediate clinical generalizability. Further validation on multi-center clinical datasets is required before practical deployment in decision-support systems.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。