Abstract
BACKGROUND: Early prediction of in-hospital death remains a significant challenge due to the limited availability of structured data during initial admission. Unstructured clinical notes, which often contain important observations and impressions, are an underutilized resource for real-time risk stratification. While leveraging recent advances in large language models (LLM) is a promising approach to use this unstructured information, the lack of understanding of the uncertainty of LLM predictions, at the patient level, for such critical forecasts is a serious deterrence for their use in clinical settings. OBJECTIVE: This study aims to evaluate the effectiveness and confidence, in predicting in-hospital death probability for an individual patient using LLMs, specifically GPT-4o and unstructured clinical notes. METHODS: We applied conformal prediction to quantify the uncertainty of GPT-4o's zero-shot predictions for in-hospital death, leveraging concatenated clinical notes documented from the first 24 hours of intensive care unit (ICU) admission in MIMIC-III for patients with acute kidney failure who were admitted through the emergency department (ED). RESULTS: Across both classes "in-hospital death" and "in-hospital survive", the GPT model performed better on the in-hospital death class, achieving precision 0.52 (95% CI 0.48-0.56), recall 0.93 (95% CI 0.90-0.95), and F1-score 0.66 (95% CI 0.63-0.70). The conformal prediction (CP) framework provided an overall empirical coverage of 90.4%, exceeding the target threshold of 90%. However, class-specific coverage was imbalanced, with 99.7% for the death and 81.1% for the survived class. CONCLUSIONS: The model's outputs exhibit overconfidence, particularly in cases of incorrect predictions. Integrating conformal prediction provides a promising approach to quantifying and calibrating uncertainty in large language model outputs for individual patient predictions, thereby enhancing their potential applicability for clinical decision-making.