Abstract
Chronic‑disease risk models using electronic health record (EHR) data inform screening and resource allocation. Calibration (expected calibration error, slope, and intercept), transportability under temporal or site shifts, and decision utility (net benefit) govern the clinical value. Narrative synthesis of comparative studies from January 2019 to October 8, 2025, appraised classical regression and gradient‑boosted decision tree (GBDT) models against deep neural networks (DNNs) and foundation backbones. Evidence indicated that modern tree-based methods often achieved lower Brier scores and external calibration errors than logistic regression, but logistic regression retained a calibration slope close to 1 under temporal drift in several datasets. DNNs frequently underestimated risk for high‑risk deciles, whereas models derived from foundation backbones improved calibration and decision utility only after local recalibration and were most efficient when labels were scarce. Across tasks, decision curves showed that net benefit increased only when recalibration maintained expected calibration error (ECE) ≤0.03. Operationally, acceptance criteria should couple the calibration slope of 0.90-1.10 with pre‑specified threshold performance and monitoring schedules.