Evaluating ChatGPT's Performance in Classifying Pertrochanteric Fractures Based on Arbeitsgemeinschaft für Osteosynthesefragen/Orthopedic Trauma Association (AO/OTA) Standards

根据 Arbeitsgemeinschaft für Osteosynthesefragen/骨科创伤协会 (AO/OTA) 标准评估 ChatGPT 在转子间骨折分类方面的性能

阅读:2

Abstract

Introduction Generative Pre-Training Transformer (ChatGPT) has become widely recognized for its capability to generate text, synthesize complex information, and perform a variety of tasks without requiring human specialists for data collection. The latest iteration, ChatGPT-4, is a large multimodal model capable of integrating both text and image inputs, rendering it particularly promising for medical applications. However, its efficacy in analyzing radiographic images remains largely unexplored. Aim This study aims to (i) address the lack of data on the accuracy of ChatGPT in radiographic fracture classification into stable or unstable under the revised Arbeitsgemeinschaft für Osteosynthesefragen/Orthopedic Trauma Association (AO/OTA) classification system, and this procedure is also performed by surgeons, and (ii) compare the agreement between surgeons or ChatGPT-based performance. The study hypothesizes that the use of ChatGPT would achieve moderate agreement with orthopedic surgeons. Materials and methods Patients diagnosed with pertrochanteric fractures were retrospectively collected. Patients with both preoperative two-directional plain radiographs and CT scans (3D-CT) images were conditioned for enrollment into the study. Two orthopedic surgeons (observer 1 and observer 2, respectively) and one resident (observer 3) were once assigned to dichotomized groups into A1 (stable) or A2 (unstable) based on AO/OTA classification using two-directional plain radiographs. Prior to the ChatGPT study, all the anteroposterior images trimmed at the fractured side, attached with figure names including gender, and age, were inputted into OpenAI ChatGPT-4. Radiological evaluation prompts were designed to initiate ChatGPT's classification analysis of the uploaded radiographic images. A single observer (MN) decided the classification patterns by examining 3D CT scan images as well as plain radiographs. This judgment of A1 (stable) and A2 (unstable) was set as a benchmark to mark the results of observers and ChatGPT based on plain radiographs. Results The cohort consisted of 29 males and 90 females, with a mean age of 87 years after the data exclusion. The fractures were classified into A1 (stable) and A2 (unstable) groups based on CT imaging. The A1 group included 50 patients (13 males, 37 females; mean age: 86.2 ± 7.8 years), while the A2 group included 69 patients (16 males, 53 females; mean age: 87.0 ± 7.9 years). Kappa values for fracture classification between plain radiographs evaluated by the three observers and ChatGPT, compared to the CT-based gold standard, showed fair to moderate agreement: Observer 1: 0.494 (95% CI: 0.337-0.650), Observer 2: 0.390 (95% CI: 0.227-0.553), Observer 3: 0.360 (95% CI: 0.198-0.521), and ChatGPT: 0.420 (95% CI: 0.255-0.585). ChatGPT demonstrated accuracy, sensitivity, specificity, and positive and negative predictable values comparable to the human observers, suggesting moderate reliability. Conclusion This study demonstrates that ChatGPT can classify pertrochanteric fractures into A1 (stable) and A2 (unstable) under the Revised AO/OTA Classification System. Its moderate agreement with CT-based assessments (κ = 0.420) is comparable to the performance of orthopedic surgeons. Moreover, ChatGPT is straightforward to integrate into clinical workflows, requiring minimal data collection for training.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。