Abstract
BACKGROUND: This study aims to explore the use of readily available complete blood count (CBC) indicators, combined with machine learning algorithms, to build a predictive model for mental disorders. METHODS: This study recruited 1,379 university volunteers in September 2024, collecting data on age, gender, and 22 CBC variables. The dependent variable was a binary outcome assessed by the university's mental health evaluation system based on the SCL-90 scale, consisting of a positive group with mental disorders and a negative group without mental disorders. SMOTETomek hybrid sampling was applied to resolve data imbalance. Random Forest (RF) was used for feature selection. This study then constructed and compared four machine learning models: eXtreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), Random Forest (RF), and Gradient Boosting Decision Tree (GBDT). Model performance was evaluated using AUC, F1-score, accuracy, sensitivity, and specificity. The Shapley Additive exPlanations (SHAP) method was employed to interpret the optimal model. Furthermore, a logistic regression (LR) algorithm was used to build a nomogram. RESULTS: Among the 1,379 volunteers, 1,023 tested negative and 356 tested positive. Fifteen volunteers had missing data for four indicators. Feature selection based on the random forest method identified 14 optimal variables for model construction. Among the six machine learning algorithms tested, XGBoost demonstrated the best performance with the highest AUC, reaching 0.860 on the training set and 0.827 on the testing set. A SHAP analysis of the XGBoost model and the nomogram results both confirmed that the top three contributing features were Basophil Percentage (BASO%), Basophil Count (BASO#), and Mean Corpuscular Hemoglobin (MCH). CONCLUSION: This study successfully developed a mental disorders prediction model based on the XGBoost algorithm and complete blood count data, providing clinicians with objective risk assessment indicators to assist in diagnosis and improve both efficiency and accuracy.