Performance of the ChatGPT-5 Language Model in Solving a Specialty Examination in Balneology and Physical Medicine

ChatGPT-5语言模型在解决温泉疗法和物理医学专业考试中的表现

阅读:1

Abstract

Background In recent years, there has been a breakthrough in the development of advanced computational systems based on neural networks. One such system is ChatGPT, first released in 2018, whose potential was quickly recognized, leading to its global popularity. Language models are increasingly capable of addressing complex problems, making them a promising tool to support the training of medical professionals. A particularly important aspect is AI's ability to solve medical examinations, such as the Medical Final Examination (LEK) and the National Specialty Examination (PES), as well as international exams, including the United States Medical Licensing Examination and various specialty board examinations. Objective The objective of this study is to analyze the potential of the latest publicly available version of the ChatGPT-5 model in addressing examination questions in balneology and physical medicine as part of the PES. The study focuses on analyzing the accuracy of the model's answers and evaluating the confidence of its decisions to assess its potential use as a supportive tool in medical education and specialty exam preparation. Materials and methods The experiment was based on the official Spring 2024 PES in Balneology and Physical Medicine, which consisted of 120 questions. The correctness of ChatGPT-5's answers was verified against the official key prepared by the Center for Medical Examinations (CEM), while also recording the model's self-declared confidence level on a 1-5 scale. Both the answer key and the examination database were obtained from the official CEM website. Prior to testing, ChatGPT-5 was introduced to the rules of the examination and provided with the full set of questions in Polish. The questions were divided into two groups: clinical and theoretical. Two questions were excluded due to inconsistency with current medical knowledge. Statistical analyses, including the chi-square test and the Mann-Whitney U test, were performed using Microsoft Excel (Microsoft Corporation, Redmond, WA, USA) and GraphPad Prism (GraphPad Software, San Diego, CA, USA). Results ChatGPT-5 provided 83 correct answers (70.34%), thereby surpassing the passing threshold. No statistically significant differences were observed between clinical and theoretical questions in terms of answer accuracy (p = 0.983), suggesting that the discrepancies were more likely attributable to random variation rather than true differences. Answer correctness was positively correlated with the model's self-assessed confidence level (p = 0.029): the higher the declared confidence, the greater the likelihood of a correct response. The Mann-Whitney U test (p = 0.07) indicated that the difference in confidence levels between clinical and theoretical questions did not reach statistical significance (α = 0.05), although a trend toward potential differences was observed. Conclusions ChatGPT-5 demonstrated sufficient performance to pass the specialization examination in Balneology and Physical Medicine. The model displayed lower confidence in solving advanced clinical questions compared to theoretical ones. Answer accuracy was correlated with the assigned confidence level. While the Mann-Whitney U test (p = 0.07) did not confirm statistically significant differences in confidence between the two categories of questions, it suggested a possible trend. Further expert research is required before such models can be widely implemented in medical education.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。