Abstract
BACKGROUND: Coronary heart disease (CHD), the most common form of heart disease, progresses over years before culminating in serious cardiac events. Early prediction and intervention are critical to reducing CHD-related morbidity, mortality, and healthcare burden. OBJECTIVE: To develop and validate a machine learning model using statewide electronic health records (EHRs) to predict 1-year risk of CHD in the general population of Maine, enabling targeted preventive strategies. METHODS: Two population-based cohorts were constructed from the Maine Health Information Exchange (HIE): a retrospective cohort for model training and calibration (2015–2017, N = 1,042,124), and a prospective cohort for external validation (2016–2018, N = 1,040,158). EHR features included demographics, diagnoses, procedures, medications, labs, and utilization metrics. A multistage modeling pipeline—comprising statistical filtering, XGBoost-based feature selection, risk prediction, and isotonic regression calibration—was used to construct the final model. Validation included discrimination, calibration, and survival analysis. RESULTS: The final XGBoost model achieved strong discrimination: AUC = 0.952 (95% CI: 0.950–0.954) in the retrospective cohort and 0.888 (95% CI: 0.885–0.890) in the prospective cohort. Based on calibrated risk probabilities, the population was stratified into five risk categories: very low (92.30%, N = 960,021), low (6.79%, N = 70,676), medium (0.85%, N = 8,888), high (0.05%, N = 554), and very high (0.002%, N = 19). Among the very high-risk group, 11 individuals (57.89%) developed CHD within one year. CONCLUSIONS: This statewide, HIE-based CHD risk prediction model demonstrates robust performance and real-world applicability. It enables early identification of high-risk individuals and supports population-scale precision prevention through evidence-informed, proactive care. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12889-025-24266-y.