Benchmark evaluation of large language models for clinical decision support in headache management

大型语言模型在头痛管理临床决策支持中的基准评估

阅读:1

Abstract

BACKGROUND: Headache disorders are a major cause of disability worldwide. In routine practice, diagnosis and guideline-based management are difficult because symptoms can overlap between primary and secondary headaches, and clinicians must combine clinical, imaging, and pathological information. Large language models (LLMs) are being proposed to assist clinical reasoning, but their performance on headache cases and their sensitivity to prompting have not been systematically assessed. METHODS: We evaluated seven leading LLMs using 13 headache cases from the New England Journal of Medicine (NEJM). We compared two prompting strategies: ask-in-sequence (AS) and ask-at-once (AO). Using a 5-point Likert rubric, three headache specialists independently scored six dimensions: rationality of diagnostic thinking, comprehensiveness of differential diagnosis, diagnostic accuracy, completeness of pathological diagnosis, clinical management, and supplementary value. Readability was measured with Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL). We analyzed differences across models, prompting strategies, and cases. RESULTS: Diagnostic accuracy differed by model: in the AS strategy, ChatGPT-4o outperformed Grok-3. Supplementary value also varied: in AS, Grok-3 outperformed ChatGPT-5 and Hunyuan-T1; in AO, DeepSeek-R1 outperformed ChatGPT-5. Overall, supplementary value was generally higher with AS, while strategy-related differences in diagnostic accuracy were observed only for Grok-3. Performance also depended on the case; C8 and C11 consistently received very low scores, suggesting difficulty integrating psychiatric or warning signs with pathological findings. Readability differed significantly: Gemini 2.5 Pro had the highest FRE (best readability) across strategies, and AS outputs generally had higher FRE. Within AS, ChatGPT-4o had the highest FKGL (worst readability). No significant model differences were found for the other four clinical dimensions. CONCLUSIONS: This study provides a structured, reproducible evaluation of LLMs on headache case analysis. While some models improved supplementary value, diagnostic accuracy, or readability, overall clinical accuracy remains below expert performance and is not sufficient for unsupervised clinical use.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。