Abstract
BACKGROUND: Several studies have used electronic health records (EHRs) to build machine learning models predicting the likelihood of developing gestational diabetes mellitus (GDM) later in pregnancy, but none have described validation of the GDM "label" within the EHRs. OBJECTIVE: This study examines the accuracy of GDM diagnoses in EHRs compared with a clinical team database (CTD) and their impact on machine learning models. METHODS: EHRs from 2018 to 2022 were validated against CTD data to identify true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Logistic regression models were trained and tested using both EHR and validated labels, whereafter simulated label noise was introduced to increase FP and FN rates. Model performance was assessed using the area under the receiver operating characteristic curve (ROC AUC) and average precision (AP). RESULTS: Among 3952 patients, 3388 (85.7%) were correctly identified with GDM in both databases, while 564 cases lacked a GDM label in EHRs, and 771 were missing a corresponding CTD label. Overall, 32,928 (87.5%) of cases were TN, 3388 (9%) TP, 771 (2%) FP, and 564 (1.5%) FN. The model trained and tested with validated labels achieved an ROC AUC of 0.817 and an AP of 0.450, whereas the same model tested using EHR labels achieved 0.814 and 0.395, respectively. Increased label noise during training led to gradual declines in ROC AUC and AP, while noise in the test set, especially elevated FP rates, resulted in marked performance drops. CONCLUSIONS: Discrepancies between EHR and CTD diagnoses had a limited impact on model training but significantly affected performance evaluation when present in the test set, emphasizing the importance of accurate data validation.