Abstract
STUDY OBJECTIVE: To compare a computable structured opioid use disorder (OUD) phenotype currently used to trigger emergency department (ED) clinical decision support (CDS) with a large language model (LLM) for OUD identification, using expert physician review as the reference standard. METHODS: We conducted a retrospective study of randomly sampled adult ED encounters (January 1, 2023-October 17, 2024) at a single academic health system. Encounters were stratified by structured phenotype status and weighted to reflect population prevalence. The structured phenotype, implemented operationally to activate CDS, incorporated diagnosis codes, medications for OUD, urine toxicology results, addiction consultations, and keyword recognition. An LLM (ChatGPT 4.1) analyzed ED notes from the index visit using a zero-shot prompt. Two board-certified emergency physicians independently determined OUD status by full chart review; discrepancies were adjudicated by a third reviewer. We calculated weighted sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). RESULTS: Among 302 encounters, weighted OUD prevalence was 5.6% (95% CI 4.0-7.0%). The structured phenotype demonstrated sensitivity 0.84 (95% CI 0.42-0.97) and specificity 0.964 (95% CI 0.96-0.97) (PPV 0.58; NPV 0.99). The LLM demonstrated sensitivity 0.81 (95% CI 0.70-0.88) and specificity 0.996 (95% CI 0.993-0.998) (PPV 0.92; NPV 0.99). Specificity was significantly higher for the LLM (p<0.0001). CONCLUSION: Both approaches demonstrated strong diagnostic performance. Although the structured phenotype showed slightly higher sensitivity, the LLM achieved higher specificity and PPV, suggesting potential to reduce false-positive alerts in ED workflows. Prospective validation in larger, representative populations is needed to guide clinical implementation.