Abstract
Renal cell carcinoma (RCC) is the leading cause of urinary system morbidity and mortality. Early identification is crucial for improving RCC patient outcomes. This study aims to construct and validate an RCC prediction model for at-risk individuals using machine learning (ML) based on routine clinical data. Data from the Quanzhou First Hospital Affiliated with Fujian Medical University between March 2014 and March 2024 were retrospectively collected, with 70% randomly assigned to the training cohort and 30% to the validation cohort. Univariate and hierarchical clustering methods were employed to identify discriminatory features to enable optimal ML algorithm selection. The performance of 7 kinds of ML algorithms-based models was evaluated based on sensitivity (recall), accuracy, F1-score, area under the receiver operating curve (AUC), discrimination, calibration, and clinical net benefit. The algorithm achieving the best AUC was selected for combination with recursive feature elimination to identify features that maximize model performance and stability. After that, the RCC prediction model was finally constructed, and the Shapley Additive Explanations method was used to visualize model characteristics and individual case predictions. Among those algorithms, the eXtreme Gradient Boosting algorithm achieving the best performance was selected for final construction. Combined with the recursive feature elimination method, it identified 21 clinically relevant variables, including age, total protein, albumin, total bilirubin, alanine aminotransferase, alkaline phosphatase, gamma-glutamyl transpeptidase, glucose, lactate dehydrogenase, creatine kinase-MB, creatinine, potassium-chloride ratio, sodium ion, calcium ion, eosinophil count, hemoglobin, platelet count, Systemic Immune-Inflammation Index, Pan-Immune-Inflammation Value, platelet-lymphocyte ratio, and sodium-chloride ratio for RCC model construction. Subsequently, a RCC prediction model and eXtreme Gradient Boosting using these 21 variables was built, achieving AUC of 0.955 (95% CI: 0.938-0.976) and an average precision of 0.923 in the validation cohort. The additional calibration curve showed high agreement between predicted and observed risks. Finally, the Shapley Additive Explanations method well demonstrated the importance of all model features and provided case-specific interpretation for clinicians. We developed and validated an ML model using routine clinical data for large-scale RCC screening. This cost-effective approach facilitates the early detection of and intervention for RCC, which may lead to improved clinical outcomes.