Abstract
BACKGROUND: Prostate cancer, benign prostatic hyperplasia, and prostatitis share substantial overlap in clinical symptoms and biological characteristics, which hampers non-invasive and early differential diagnosis. Untargeted metabolomics enables comprehensive profiling of disease-associated metabolic alterations; however, its high dimensionality and strong feature correlations challenge conventional statistical approaches. METHODS: To address this, we analyzed serum untargeted LC-MS data following standardized preprocessing. We adopted a nested cross-validation strategy to evaluate various feature selection methods and machine learning classifiers, ultimately determining that multiclass LASSO regression was the most effective feature selection approach. RESULTS: An optimized Random Forest model demonstrated strong, superior performance in distinguishing between prostate cancer, prostatitis, benign prostatic hyperplasia, and healthy controls (out-of-fold accuracy: 93.8%; macro-F1: 0.937). Additionally, SHAP (SHapley Additive exPlanations) analysis translated feature statistical importance into biologically meaningful modules, revealing that distinct, disease-specific patterns of metabolic reprogramming drove the model's robust multiclass discrimination. CONCLUSIONS: This study demonstrates the value of integrating serum untargeted metabolomics with advanced explainable machine learning for effective multiclass differentiation of major prostate diseases, providing a promising non-invasive framework for diagnostic stratification and metabolic biomarker discovery.