Abstract
This study aims to synthesize current evidence on artificial intelligence-based sepsis prediction models for emergency department patients and propose practical benchmarks that emphasize standardized data preparation and reproducible model characterization. Literature searches were conducted across Scopus, Web of Science, PubMed, MEDLINE, and Embase. Eligible studies were selected through a two-tiered screening process, followed by data extraction and assessment according to predefined criteria. Random-effects meta-analysis was used to quantify model performance, and heterogeneity was explored by subgroup, regression, and sensitivity analyses. A total of 36 studies comprising 98 predictive models were included, with a pooled area under the receiver operating characteristic curve of 0.87 (95% CI: 0.86–0.88). Differences in performance were associated with study-level methodologies, including target definition, data provenance, cohort scale, data preprocessing, feature representation, and model development. The integrated meta-regression further identified independent methodologies influencing model performance. Artificial intelligence-based models showed higher pooled predictive performance than widely used traditional scoring systems for sepsis in emergency departments. However, translation into practice remains limited by inconsistent evaluation and reporting, and by inadequate external validation. Standardized methodological benchmarks have the potential to improve reproducibility, comparability, and clinical applicability. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s10916-026-02376-3.