Abstract
BACKGROUND: Despite substantial progress in biomarker research, Parkinson's disease (PD) still lacks widely validated, easily deployable diagnostic tests for reliable early-stage detection, particularly in resource-limited circumstances. OBJECTIVE: This study aimed to develop and externally validate a lightweight machine learning model for the first-diagnosis prediction of PD using baseline cerebrospinal fluid (CSF) biomarkers from the Parkinson's Progression Markers Initiative (PPMI). METHODS: Baseline CSF data from 665 participants (PD = 415, controls = 190, SWEDD = 60) were used. Five machine learning classifiers-L2-regularized logistic regression (L2-LR), random forest (RF), histogram-based gradient boosting (HistGB), support vector machine with RBF kernel (SVM-RBF), and multilayer perceptron (MLP)-were trained and compared. Feature selection focused on five core CSF biomarkers (Aβ42, α-synuclein, total tau, phosphorylated tau181 and hemoglobin). Model performance was evaluated using AUC, PR-AUC, and Brier scores, followed by isotonic calibration and independent validation using the University of Pennsylvania dataset. RESULTS: A lightweight, biomarker-based RF model effectively distinguishes first-diagnosis PD cases using limited baseline CSF indicators. Its offline Streamlit deployment offers a practical tool for resource-limited settings, bridging the gap between computational prediction and real-world neurological diagnosis.