Evaluating the Performance of ChatGPT4.0 Versus ChatGPT3.5 on the Hand Surgery Self-Assessment Exam: A Comparative Analysis of Performance on Image-Based Questions

评估 ChatGPT4.0 与 ChatGPT3.5 在手外科自测题中的表现：基于图像题目的表现比较分析

阅读：1

作者：Vrindten,Kiera L,Hsu,Megan,Han,Yuri,Rust,Brian,Truumees,Heili,Katt,Brian M

期刊：	Cureus Journal of Medical Science	影响因子：	1.300
时间：	2025	起止号：	2025 Jan;17(1):e77550
doi：	10.7759/cureus.77550

Abstract

Hypothesis The emergence of ChatGPT as an artificial intelligence (AI) platform has become an increasingly useful tool in medical education, especially within resident education to supplement preparation for certification exams. As the AI model inevitably progresses, there is an increased need to establish ChatGPT's accuracy in specialty knowledge. Our study assesses the performance of ChatGPT4.0 on self-assessment questions pertaining to hand surgery in comparison to the performance of its predecessor ChatGPT3.5. A distinct feature of ChatGPT4.0 is its ability to interpret visual input which ChatGPT3.5 cannot. We hypothesize that ChatGPT4.0 will perform better on image-based questions than ChatGPT3.5. Methods This study used 10 self-assessment exams from 2004 to 2013 from the American Society for Surgery of the Hand (ASSH). Performance on image-based questions was compared between ChatGPT4.0 and ChatGPT3.5. The primary outcome was the total score as a proportion of answers correct. Secondary outcomes were the proportion of questions for which ChatGPT4.0 provided elaborations, the length of those elaborations, and the number of questions for which ChatGPT4.0 provided answers with confidence. Descriptive analysis, Student's t-test, and one-way ANOVA tests were used for data analysis. Results Out of 455 image-based questions, there was no statistically significant difference in the total score between ChatGPT4.0 and ChatGPT3.5. ChatGPT4.0 answered 137 (30.1%) questions correctly while ChatGPT3.5 answered 131 (28.7%) correctly (p= 0.805). Although there was no significant difference in the length or frequency of elaborations in relation to the proportion of correct answers between the two versions, ChatGPT4.0 did provide significantly longer explanations overall compared to ChatGPT3.5 (p<0.05). Moreover, of the 455 total image-based questions, ChatGPT4.0 provided significantly less confident answers compared to ChatGPT3.5 (p<0.05). Of those responses in which ChatGPT4.0 expressed uncertainty, there was a significant difference based on image type, with the highest uncertainty stemming from question stems involving radiograph-based images (p<0.001). Summary points Overall, there was no significant difference in performance between ChatGPT4.0 and ChatGPT3.5 when answering image-based questions on the ASSH self-assessment examinations. Notably, however, ChatGPT4.0 expressed more uncertainty with answers. Further exploration of how AI-generated responses influence user behavior in clinical and educational settings will be crucial to optimizing the role of AI in healthcare.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。

肿瘤免疫

炎症

T细胞

线粒体

凋亡

转录调控

巨噬细胞

自噬

传染病

氧化应激

肠道菌群

磷酸化

血管生成

囊泡

3D/类器官

单细胞

中性粒细胞

外泌体

DNA甲基化

miRNA

药物研究

铁死亡

细胞衰老

乙酰化

缺氧低氧

泛素化

树突状细胞

炎性小体

组蛋白修饰

肿瘤微环境

lncRNA

代谢重编程

焦亡

m6A/m5C/m7G

内质网应激

空间多组学

细胞基因治疗

治疗耐药

相分离

Treg

上皮间质转化

免疫代谢

染色质重塑

脂质过氧化

蛋白质稳态

脂代谢

细胞极性

铁代谢

氨基酸代谢

碱基编辑

cGAS-STING

肠脑轴

蛋白降解

乳酸化

翻译调控

circRNA

piRNA

肿瘤异质性

NK 细胞

氧化脂质

MDSC

NETosis

低氧缺氧

溶酶体功能

琥珀酰化

细胞干性

CAR-NK

冷应激

RNA 编辑

Tfh

巴豆酰化

器官芯片

表观遗传记忆

铜死亡

器官纤维化

线粒体未折叠蛋白反应

空间代谢组

程序性坏死

自噬流

MAIT 细胞

肠肝轴

丙酰化