Diagnostic capabilities of large language models in the detection of scaphoid fractures in the emergency department

大型语言模型在急诊科检测舟骨骨折的诊断能力

阅读：1

作者：Bulut,Bensu,Yortanlı,Mehmet,Gür,Ayşenur,Akkan Öz,Medine,Mutlu,Hüseyin

期刊：		影响因子：
时间：	2025	起止号：	2025 Oct;31(10):987-994
doi：	10.14744/tjtes.2025.98680	疾病类型：	骨折

Abstract

BACKGROUND: Scaphoid fractures account for 60%-70% of wrist traumas, with delayed diagnosis leading to avascular necrosis and functional impairment. Traditional radiographic assessment remains challenging due to anatomical complexity and overlapping structures. This study evaluated three next-generation large language models (LLMs) (ChatGPT-4o, Gemini 2.0, and Claude 3.5) for their ability to detect scaphoid fractures and determine surgical indications. METHODS: A retrospective observational study was conducted at Ankara Etlik City Hospital (October 2022 - January 2025) including 300 patients (150 with computed tomography confirmed (CT-confirmed) scaphoid fractures and 150 without fractures), aged 18-65 years, who presented to the emergency department (ED) with wrist trauma. Three-view wrist radiographs were presented to each LLM on three separate days. Diagnostic accuracy was assessed using overall accuracy (all three responses correct), strict accuracy (≥2 correct responses), and ideal accuracy (≥1 correct response). Response consistency was evaluated using Fleiss' kappa coefficient. Surgical indications were determined based on fracture displacement criteria. RESULTS: Claude 3.5 demonstrated superior sensitivity (57.1%) compared to Gemini 2.0 (18.2%) and ChatGPT-4o (9.1%) for fracture detection (p<0.001). Ideal accuracy rates were 79.3%, 36.0%, and 17.3%, respectively. Specificity remained uniformly low across models (43.1%-43.8%). All models performed better in non-fracture cases, with ideal accuracy exceeding 83%. Response consistency was moderate for all models (κ=0.36-0.41). For surgical indication assessment, Claude 3.5 identified 37.0% of cases requiring surgery, compared to ChatGPT-4o (34.1%) and Gemini 2.0 (24.4%), with correct determination rates of 73.7%, 71.4%, and 80.0%, respectively. CONCLUSION: Current LLMs demonstrate insufficient diagnostic accuracy for independent clinical use in scaphoid fracture detection. Claude 3.5's 57.1% sensitivity indicates that these technologies require substantial improvement before clinical deployment. However, their moderate performance in surgical decision-making suggests potential utility as assistive tools when combined with specialist expertise. Further development focusing on musculoskeletal-specific training is essential.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。