Abstract
BACKGROUND: Early identification of noninvasive respiratory support (NIRS) failure in acute respiratory failure (ARF) is clinically relevant, as delayed intubation is associated with worse outcomes. Machine learning-based prediction models have been proposed to support escalation decisions, but their performance and reliability remain uncertain. OBJECTIVE: To systematically evaluate the discriminative performance of machine learning-based models for predicting NIRS failure in adults with ARF. METHODS: We conducted a systematic review and meta-analysis following PRISMA 2020 guidelines and registered the protocol in PROSPERO (CRD420251167330). PubMed, Web of Science, and Scopus were searched from January 2010 to the final search date. Cohort studies developing or validating machine learning models to predict NIRS failure, primarily defined as endotracheal intubation, were included. Discrimination was assessed using the area under the receiver operating characteristic curve (AUC). Logit-transformed AUCs were synthesized using random-effects models with restricted maximum likelihood estimation and Hartung-Knapp confidence intervals. Risk of bias and certainty of evidence were assessed using PROBAST-AI and GRADE, respectively. RESULTS: Fourteen cohort studies comprising 34,500 patients were included. The descriptive pooled AUC was 0.84 (95% CI, 0.78-0.89) with extreme heterogeneity (I(2) = 99.5%) and wide prediction intervals. Subgroup analyses showed no statistically significant differences by validation strategy or type of noninvasive respiratory support. All studies were rated at high risk of bias, and the certainty of evidence was very low. CONCLUSION: Machine learning-based models demonstrate moderate discrimination; however, extreme heterogeneity, high risk of bias, and very low certainty of evidence preclude clinical implementation. SYSTEMATIC REVIEW REGISTRATION: https://www.crd.york.ac.uk/PROSPERO/view/CRD420251167330.