Abstract
Machine learning methods, especially the K_Means clustering method, have demonstrated potential in analyzing medical data by facilitating pattern detection. However, the classic K_Means algorithm suffers from two major limitations: (1) its reliance on a single, often suboptimal distance metric (typically Euclidean), and (2) the lack of a mechanism to refine clusters post-assignment, which can lead to poor cohesion and misgrouping. To address these challenges, this paper proposes a novel enhanced K-Means clustering framework with two key innovations: (i) a hybrid distance approach that combines cosine and cityblock (Manhattan) metrics in a tunable weighted manner to better capture the structure of medical data and (ii) an efficient cluster refinement mechanism based on Z-score outlier detection to reassign distant samples and improve cluster quality. First, we evaluate K_Means using five distance metrics-Euclidean, cosine, cityblock, Chebyshev, and Minkowski-on two public medical datasets: Breast Cancer Wisconsin (BCW) and Heart Disease. Then, we introduce the hybrid distance strategy, systematically varying the weight between cosine and cityblock to identify the optimal combination. Following initial clustering, our refinement step identifies data points far from their cluster centroids (using Z-score) and reassigns them to more suitable clusters, significantly enhancing cluster homogeneity and separation. The proposed method is evaluated using multiple metrics: accuracy, precision, recall, F1-score, Adjusted Rand Index (ARI), homogeneity, and execution time. Results show substantial improvements over traditional approaches and advanced clustering methods (deep clustering and spectral clustering methods). For the BCW and Heart Disease datasets, the proposed method achieves accuracies of 0.9825 and 0.9000, outperforming Euclidean K-Means (0.8752, 0.8316) and cosine-based K-Means (0.9350, 0.8418). Homogeneity scores also enhance significantly from 0.7721 to 0.8676 (for BCW dataset) and from 0.4335 to 0.5352 (for Heart Disease dataset)-demonstrating the effectiveness of the refinement step. This work presents an original, practical enhancement to K_Means clustering for healthcare applications, offering improved accuracy, interpretability, and robustness through a hybrid distance strategy and a novel refinement mechanism. The results provide deeper insights into unsupervised learning for medical data analysis and support its potential in real-world clinical decision-making.