Evaluating GPT-4's visual interpretation and clinical reasoning on emergency settings: A 5-year analysis

评估 GPT-4 在急诊环境下的视觉解读和临床推理能力:一项为期 5 年的分析

阅读:1

Abstract

BACKGROUND: The use of generative AI, particularly large language models such as GPT-4, is expanding in medical education. This study evaluated GPT-4's ability to interpret emergency medicine board exam questions, both text- and image-based, to assess its cognitive and decision-making performance in emergency settings. METHODS: An observational study was conducted using Taiwan Emergency Medicine Board Exam questions (2018-2022). GPT-4's performance was assessed in terms of accuracy and reasoning across question types. Statistical analyses examined factors influencing performance, including knowledge dimension, cognitive level, clinical vignette presence, and question polarity. RESULTS: GPT-4 achieved an overall accuracy of 60.1%, with similar results on text-based (60.2%) and image-based questions (59.3%). It showed perfect accuracy in identifying image types (100%) and high proficiency in interpreting findings (86.4%). However, accuracy declined in diagnostic reasoning (83.1%) and further dropped in final decision-making (59.3%). This stepwise decrease highlights GPT-4's difficulty integrating image analysis into clinical conclusions. No significant associations were found between question characteristics and AI performance. CONCLUSION: GPT-4 demonstrates strong image recognition and moderate diagnostic reasoning but limited decision-making capabilities, especially when synthesizing visual and clinical data. Although promising as a training tool, its reliance on pattern recognition over clinical understanding restricts real-world applicability. Further refinement is needed before AI can reliably support emergency medical decisions.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。