Abstract
Prediabetes can progress to type 2 diabetes (T2D), but individual risk varies widely. Few studies have rigorously characterized subgroups at the point of prediabetes (PD) onset. Using electronic health records (EHRs), we developed a machine learning approach to stratify PD and analyze T2D progression risk. We defined PD onset based on strict HbA1c criteria and excluded patients with missing follow-ups or atypical clinical events, yielding a high-fidelity cohort of 14,436 patients from an initial pool of 74,054 (2017-2023, MedStar Health). An XGBoost model using routine features, including HbA1c, BMI, blood pressure, lipids, ALT, medication history, and lifestyle factors, was trained on 2018-2020 data and tested on 2021-2022 patients, achieving an AUC of 81.6%. Risk scores enabled subtyping into high-, medium-, and low-risk groups with distinct progression trajectories. Stratification patterns remained consistent in future cohorts. This approach supports earlier, personalized intervention and diabetes risk prediction using real-world EHR data.