Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents

大型语言模型在欧洲普通外科医师资格考试中表现不佳:一项与专家和外科住院医师的对比研究

阅读:1

Abstract

BACKGROUND: Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini, GPT-4o, and Copilot) with specialists and residents on European General Surgery Board test questions, focusing on accuracy across question formats, lengths, and difficulty levels. METHODS: A total of 120 multiple-choice questions were systematically sampled from the General Surgery Examination and Board Review question bank using a structured randomization protocol. The questions were administered via Google Forms to four large language models (Llama-3, GPT-4o, Gemini, and Copilot) and 30 surgeons (15 board-certified specialists and 15 residents) under timed, single-session conditions. Participant demographics (age, gender, years of experience) were recorded. Questions were categorized by word count (short, medium, long) and by difficulty level (easy, moderate, hard), rated by three independent board-certified surgeons. Group accuracy rates were compared using ANOVA with appropriate post-hoc tests, and 95% confidence intervals were reported. RESULTS: Board-certified surgeons achieved the highest accuracy rate at 81.6% (95% CI: 78.9-84.3), followed by surgical residents at 69.9% (95% CI: 66.7-73.1). Among large language models (LLMs), Llama-3 demonstrated the best performance with an accuracy of 65.8% (95% CI: 62.4-69.2), whereas Copilot showed the lowest performance at 51.7% (95% CI: 48.1-55.3). LLM performance declined significantly as item difficulty and length increased, particularly for Copilot (68.3% on short vs. 36.4% on long questions, p < 0.001). In contrast, human participants maintained relatively stable accuracy across difficulty levels. Notably, only Llama-3 ranked within the human performance range, placing 26th among 30 surgeons, while all other LLMs failed to surpass the 60% accuracy threshold (p < 0.001). CONCLUSION: Current LLMs underperform compared to human specialists when faced with questions requiring high-level medical knowledge, reinforcing their current role as supplementary tools in surgical education rather than replacements for expert clinical judgment.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。