Abstract
BACKGROUND: Headache disorders are a major cause of disability worldwide. In routine practice, diagnosis and guideline-based management are difficult because symptoms can overlap between primary and secondary headaches, and clinicians must combine clinical, imaging, and pathological information. Large language models (LLMs) are being proposed to assist clinical reasoning, but their performance on headache cases and their sensitivity to prompting have not been systematically assessed. METHODS: We evaluated seven leading LLMs using 13 headache cases from the New England Journal of Medicine (NEJM). We compared two prompting strategies: ask-in-sequence (AS) and ask-at-once (AO). Using a 5-point Likert rubric, three headache specialists independently scored six dimensions: rationality of diagnostic thinking, comprehensiveness of differential diagnosis, diagnostic accuracy, completeness of pathological diagnosis, clinical management, and supplementary value. Readability was measured with Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL). We analyzed differences across models, prompting strategies, and cases. RESULTS: Diagnostic accuracy differed by model: in the AS strategy, ChatGPT-4o outperformed Grok-3. Supplementary value also varied: in AS, Grok-3 outperformed ChatGPT-5 and Hunyuan-T1; in AO, DeepSeek-R1 outperformed ChatGPT-5. Overall, supplementary value was generally higher with AS, while strategy-related differences in diagnostic accuracy were observed only for Grok-3. Performance also depended on the case; C8 and C11 consistently received very low scores, suggesting difficulty integrating psychiatric or warning signs with pathological findings. Readability differed significantly: Gemini 2.5 Pro had the highest FRE (best readability) across strategies, and AS outputs generally had higher FRE. Within AS, ChatGPT-4o had the highest FKGL (worst readability). No significant model differences were found for the other four clinical dimensions. CONCLUSIONS: This study provides a structured, reproducible evaluation of LLMs on headache case analysis. While some models improved supplementary value, diagnostic accuracy, or readability, overall clinical accuracy remains below expert performance and is not sufficient for unsupervised clinical use.