Abstract
BACKGROUND AND SIGNIFICANCE: To evaluate whether large language models (LLMs) can automate chart review to identify tumor necrosis factor inhibitor (TNFi) switching patterns and reasons for switching in a large real-world cohort. MATERIALS AND METHODS: We conducted an observational study using de-identified electronic health record (EHR) data from 2012 to 2023 at a single academic medical center (University of California, San Francisco). TNFi medication orders and linked clinical notes were extracted, requiring at least 6 months of follow-up to identify treatment switches, defined as a change from one TNFi to another at consecutive encounters. Using GPT-4, we extracted which TNFi was stopped and started and classified the reason for switching. Performance was benchmarked against eight open-source LLMs, structured EHR data, and expert annotation. RESULTS: A total of 9187 patients (mean [SD] age, 39.9 [19.0] years; 57.1% female) received ≥1 TNFi with sufficient follow-up. We identified 3104 TNFi switches among 2112 patients. GPT-4 achieved micro-F1 scores of 0.75 (stopped drug), 0.80 (started drug), and 0.83 (reason). Among open-source models, Starling-7B-beta and Llama-3-8B performed most competitively. The most common reason identified by GPT-4 was lack of efficacy (56.9%), followed by adverse events (13.5%) and insurance/cost (10.8%). CONCLUSIONS: Both GPT-4 and locally deployable LLMs effectively extracted complex treatment trajectories and rationale from clinical notes, supporting their broader utility in scalable EHR review and real-world evidence generation.