Abstract
Physicians face a significant documentation burden, spending twice as much time on electronic health records (EHRs) as on direct patient care. Consultation summary reports from the emergency department (ED) are critical for continuity of care and clinical decision-making. This study aims to evaluate the quality and utility of automatically generated neurological consultation reports with clear recommendations, while reducing neurologists' documentation burden. We used neurological consultation reports (n = 250) from the ED as reference outputs. For each case, we fed the report's constituent components into the large language model (LLM). Using prompt engineering and retrieval-augmented generation (RAG) to generate auto-summarized reports, which were then compared against the original consultation reports. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and semantic embedding (Clinical-BioBert) were used as performance metrics. The LLM-generated report exhibited high semantic similarity with the neurologist's report (0.89 ± 0.03). However, significant differences in report length were observed, with LLM-generated reports being more concise than those written by attending neurologists (61.56 vs. 94.75 words, p < 0.001). Additionally, LLM-generated reports were written in a more straightforward and accessible style (FKGL = 11.3 vs. 12.22, p < 0.001). Despite these strengths, the LLM-generated reports exhibited substantial divergence in writing style from neurologists' reports (ROUGE-1 F1 = 0.25, ROUGE-2 F1 = 0.09, ROUGE-L F1 = 0.19). LLM-generated neurological consultation reports demonstrate strong semantic alignment with human-authored reports while offering a more concise and accessible format. Notable differences in writing style suggest a standardized approach that, while effective in conveying clinical content, may lack the personalization of neurologist-written reports.