Abstract
Background: Atrial fibrillation (AF) is common among intensive care unit (ICU) patients and is associated with increased mortality, prolonged length of stay (LOS), and greater resource utilization. Widely used AF risk scores were developed for stable outpatient populations and have limited applicability in critically ill patients. This study aimed to (1) characterize ICU patients with AF, (2) develop and temporally externally validate machine learning models to predict ICU mortality and ICU LOS, and (3) identify early clinical factors associated with these outcomes using interpretable methods. Methods: Adult ICU patients with AF from MIMIC-IV (n = 20,058) were used for model development with grouped cross-validation, and MIMIC-III (n = 11,475) served as a temporal external validation cohort. Predictors included demographics, admission characteristics, vital signs, laboratory values, vasoactive support, and AF-related medications available within the first 24 h of ICU admission. Eight classification algorithms were evaluated for ICU mortality, and six regression algorithms were evaluated for ICU LOS. Discrimination was primarily assessed using the area under the receiver operating characteristic curve (AUC) and average precision (AP), with additional threshold-dependent metrics reported to characterize operating-point behavior under low event prevalence. Probability-threshold optimization using out-of-fold predictions was applied to the primary mortality model. LOS performance was evaluated using mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R(2)). Model interpretability was assessed using SHapley Additive exPlanations (SHAP). Results: The median age was 75 years, and ICU mortality was 8.9%. For mortality prediction, the XGBoost model demonstrated preserved discrimination on temporal external validation (MIMIC-III) (AUC = 0.743; AP = 0.226). At the default probability threshold (0.50), recall and F1 scores were low due to low event prevalence; applying a prespecified F1-optimized threshold derived from the development cohort improved sensitivity while maintaining overall discrimination. For ICU LOS, models explained little variance on temporal validation; LightGBM performed best, but the explained variance was low (MAE = 88.9 h; RMSE = 163.9 h; R(2) = 0.038), indicating that the first 24-h structured data provide an insufficient signal to accurately predict ICU LOS, likely due to downstream clinical and operational factors. SHAP analysis identified clinically plausible predictors of mortality and prolonged ICU stay, including reduced urine output, renal dysfunction, metabolic derangement, hypoxemia, early vasopressor use, advanced age, and admission pathways.