Abstract
BACKGROUND: Sociodemographic factors influence the outcomes of prostate cancer (PCa); however, they are rarely incorporated into clinical risk prediction models. This study aimed to assess whether machine learning approaches could optimize the integration of sociodemographic variables to improve the prediction of cancer-specific survival among patients with high-risk PCa. MATERIALS AND METHODS: Data from the Surveillance, Epidemiology, and End Results database were retrospectively analyzed to identify patients diagnosed with high-risk PCa from 2010 to 2020. Two random forest models were developed: one using clinical and pathological variables (age, stage, prostate-specific antigen level, Gleason grade, time to treatment, and year of diagnosis) and another incorporating available sociodemographic features (race, income, marital status, region, and urbanicity). Five-fold cross-validation was performed to evaluate the model performance and minimize overfitting. Hyperparameter tuning via a grid search optimized the model structure. Performance was assessed using the area under the receiver operating characteristic curve (AUC), Brier scores, sensitivity, and specificity. Parallel analyses were conducted using the XGBoost software. Clinical utility was evaluated using decision curve analysis. RESULTS: We identified 80,858 patients with high-risk PCa. The clinical-only random forest model (AUC, 0.54) significantly improved with the addition of sociodemographic variables (AUC, 0.72; p < 0.001). The Brier score, sensitivity, and specificity were also superior in the combined model (all p < 0.001). Similar results were obtained for XGBoost. Gleason grade was the most predictive factor, whereas sociodemographic variables, particularly income and geographic region, were highly informative. Decision curve analysis demonstrated a higher net clinical benefit with the combined model. CONCLUSIONS: Incorporating sociodemographic variables into machine learning models significantly improved the prediction of cancer-specific survival in high-risk PCa, supporting their inclusion in risk stratification tools.