Abstract
OBJECTIVE: Almost all hormone-sensitive prostate cancer (HSPC) cases eventually progress to castration-resistant prostate cancer (CRPC) following androgen deprivation therapy (ADT). This study aims to develop a machine learning (ML) model to predict the progression of HSPC patients. Additionally, we conducted statistical analyses on the dataset to identify significant features and clinical markers predictive of HSPC transitioning to CRPC. METHODS: Data from 410 HSPC patients treated at Yunnan Cancer Hospital between 01/01/2017, and 31/05/2022, were analyzed. Predictive analyses were performed on a series of features observed during the patients' initial visits. The primary ML methods employed were decision tree (DT), random forest (RF), XGBoost, artificial neural network (ANN), and support vector machine (SVM). Feature selection was conducted using a genetic algorithm (GA). The ML models were trained with an 80% training set and validated with a 20% test set. Model performance was evaluated using the area under the ROC curve (AUC), calibration plots, and learning curves to assess fit and calibration. Evaluation metrics included accuracy (ACC), precision (PRE), specificity (SPE), sensitivity (SEN), and F1 score. RESULTS: Visualization of evaluation metrics was presented through confusion matrices and ROC curves. Ensemble learning methods, particularly RF and XGBoost, demonstrated the best model performance. RF achieved a score of 0.838 (95% CI:0.8324-0.902)on the training dataset and 0.817 (95% CI: 0.659 - 0.829) on the test dataset (AUC: 0.873, 95% CI:0.730-0.878). XGBoost achieved a score of 0.814 (95% CI:0.790-0.878) on the training dataset and 0.805 (95% CI:0.707-0.829) on the test dataset (AUC: 0.866, 95% CI:0.780-0.871). Calibration curves indicated good model calibration, and learning curves suggested no significant overfitting in both the training and test sets. CONCLUSION: Our findings demonstrate that ensemble learning methods, particularly RF, exhibit superior performance in predicting HSPC progression. This study represents a preliminary step toward a predictive tool, highlighting the potential of baseline clinical data for risk stratification. Future prospective studies with larger, multi-center cohorts are warranted to validate and refine this approach for possible clinical integration.