Abstract
Machine learning and deep learning tools have been proposed to improve survival prediction in acute myeloid leukemia (AML), but comparative benchmarks remain unclear. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 searches of PubMed, Scopus, and Web of Science (January 2018 to March 2025) identified studies developing or externally validating artificial intelligence (AI)-based models for overall or relapse-free survival reporting area under the receiver operating characteristic (ROC) area under the curve (AUC). Two reviewers extracted design, population, features, algorithms, and training/validation AUCs and assessed risk of bias using Prediction model Risk of Bias Assessment Tool (PROBAST). Random-effects meta-analysis (DerSimonian-Laird) pooled validation AUCs overall and by horizon (1/2/3/5 years) and feature category (gene-centric vs nongenetic). Optimism bias was the training-validation AUC difference. We included 24 predominantly retrospective studies (137 model cohorts; ∼51 055 patients). Of 120 PROBAST domain ratings, 74% were low risk, 25% unclear, and <1% high; statistical analysis was the weakest domain. Across 73 independent validation cohorts, the pooled AUC was 0.769 (95% confidence interval [CI], 0.742-0.795) with substantial between-study variability (I (2) = 95.7%; meaning most of the spread reflects real differences across cohorts rather than chance). Validation AUCs increased with longer horizons (1-year, 0.748; 2-year, 0.760; 3-year, 0.760; 5-year, 0.833). Pooled development AUC was 0.801 vs 0.749 in matched validation sets (ΔAUC, 0.052; 95% CI, 0.041-0.063). Nongenetic models achieved a pooled validation AUC of 0.776 vs 0.741 for gene-centric models (ΔAUC, 0.035; P = .085). AML AI prognostic models show moderate discrimination with modest optimism but substantial heterogeneity and limited prospective validation, supporting standardized reporting and rigorous external evaluation.