Abstract
Large language models (LLMs) have emerged as tools to support healthcare delivery, from automating tasks to aiding clinical decision-making. This study evaluated LLMs as alternative to rule-based alert systems, focusing on their ability to identify prescribing errors. This was designed as a prospective, cross-over, open-label study involving 91 error scenarios based on 40 clinical vignettes across 16 medical and surgical specialties. We developed and validated five LLM models using a retrieval-augmented generation framework. The best-performing model evaluated three different implementation strategies: LLM-based clinical decision support system (CDSS) alone, pharmacist plus LLM-based CDSS (co-pilot), and pharmacist alone. The co-pilot arm demonstrated the best performance with an accuracy of 61% (precision 0.57, recall 0.61, and F1 0.59). In detecting errors posing serious harm, the co-pilot mode increased accuracy by 1.5-fold over the pharmacist alone. Effective LLM integration for complex tasks like medication chart reviews can enhance healthcare professional performance, improving patient safety.