Abstract
BACKGROUND: Scaphoid fractures account for 60%-70% of wrist traumas, with delayed diagnosis leading to avascular necrosis and functional impairment. Traditional radiographic assessment remains challenging due to anatomical complexity and overlapping structures. This study evaluated three next-generation large language models (LLMs) (ChatGPT-4o, Gemini 2.0, and Claude 3.5) for their ability to detect scaphoid fractures and determine surgical indications. METHODS: A retrospective observational study was conducted at Ankara Etlik City Hospital (October 2022 - January 2025) including 300 patients (150 with computed tomography confirmed (CT-confirmed) scaphoid fractures and 150 without fractures), aged 18-65 years, who presented to the emergency department (ED) with wrist trauma. Three-view wrist radiographs were presented to each LLM on three separate days. Diagnostic accuracy was assessed using overall accuracy (all three responses correct), strict accuracy (≥2 correct responses), and ideal accuracy (≥1 correct response). Response consistency was evaluated using Fleiss' kappa coefficient. Surgical indications were determined based on fracture displacement criteria. RESULTS: Claude 3.5 demonstrated superior sensitivity (57.1%) compared to Gemini 2.0 (18.2%) and ChatGPT-4o (9.1%) for fracture detection (p<0.001). Ideal accuracy rates were 79.3%, 36.0%, and 17.3%, respectively. Specificity remained uniformly low across models (43.1%-43.8%). All models performed better in non-fracture cases, with ideal accuracy exceeding 83%. Response consistency was moderate for all models (κ=0.36-0.41). For surgical indication assessment, Claude 3.5 identified 37.0% of cases requiring surgery, compared to ChatGPT-4o (34.1%) and Gemini 2.0 (24.4%), with correct determination rates of 73.7%, 71.4%, and 80.0%, respectively. CONCLUSION: Current LLMs demonstrate insufficient diagnostic accuracy for independent clinical use in scaphoid fracture detection. Claude 3.5's 57.1% sensitivity indicates that these technologies require substantial improvement before clinical deployment. However, their moderate performance in surgical decision-making suggests potential utility as assistive tools when combined with specialist expertise. Further development focusing on musculoskeletal-specific training is essential.