Abstract
OBJECTIVE: This study aimed to develop and validate a machine learning (ML)-based model for predicting the risk of gestational diabetes mellitus (GDM) using 45 dietary nutrients and baseline data. METHODS: A retrospective analysis was conducted on 3,649 pregnant women from the NHANES database (2007–2018). Baseline data (age, race/ethnicity, BMI, etc.) and 45 dietary nutrients were collected. The Synthetic Minority Oversampling Technique (SMOTE) was applied after the train-test split to address class imbalance. Feature selection used Variance Inflation Factor (VIF) to reduce multicollinearity and the Boruta algorithm to identify core predictors. Six ML models (XGBoost, LightGBM, RF, SVM, GNB, KNN) were trained. Performance was evaluated via AUC, accuracy, sensitivity, specificity, F-Beta score (β = 2), and PR-AUC. SHAP analysis clarified feature importance. RESULTS: Core predictors included race/ethnicity, BMI, protein, dietary fiber, α-carotene, β-carotene, lutein/zeaxanthin, folate (DFE), calcium, phosphorus, zinc, potassium, alcohol intake, educational level, and smoking status. XGBoost performed best in the validation set (accuracy: 93.1%, F-Beta: 0.943, AUC: 0.966, sensitivity: 97.5%, specificity: 86.7%, PR-AUC: 0.967), followed by LightGBM (accuracy: 92.6%) and RF (accuracy: 91.4%). GNB was poorest (accuracy: 57.3%, AUC: 0.658). SHAP identified educational level, race/ethnicity, lycopene, and smoking status as top contributors. CONCLUSION: ML models integrating demographics and 45 dietary nutrients accurately predict GDM. XGBoost, LightGBM, and RF excel, with XGBoost being most effective, supporting early GDM detection in clinical practice. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13040-025-00515-z.