Comparative Evaluation of Responses from ChatGPT-5, Gemini 2.5 Flash, Grok 4, and Claude Sonnet-4 Chatbots to Questions About Endodontic Iatrogenic Events

对 ChatGPT-5、Gemini 2.5 Flash、Grok 4 和 Claude Sonnet-4 聊天机器人关于根管治疗医源性事件问题的回答进行比较评估

阅读:1

Abstract

Background: The aim of this study was to compare four recently introduced LLMs (ChatGPT-5, Grok 4, Gemini 2.5 Flash, and Claude Sonnet-4). Experienced endodontists evaluated the accuracy, completeness, and readability of the responses given to open-ended questions about iatrogenic events in endodontics. Methods: Twenty-five open-ended questions related to iatrogenic events in endodontics were prepared. The responses of the four LLMs were evaluated by two specialist endodontists using a Likert scale for accuracy and completeness, and the Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simplified Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI) for readability. Results: The accuracy score of ChatGPT-5's responses to open-ended questions (4.56 ± 0.65) was found to be significantly higher than those of Gemini 2.5 Flash (3.64 ± 0.95) and Claude Sonnet-4 (3.44 ± 1.19) (p = 0.009, and p = 0.002, respectively). Similarly, the completeness score of ChatGPT-5 (2.88 ± 0.33) was higher than those of Claude Sonnet-4, Gemini 2.5 Flash, and Grok 4 (p < 0.001, p = 0.002, and p = 0.007, respectively). In terms of readability measures, ChatGPT-5 and Gemini 2.5 Flash achieved better FRESs than Claude Sonnet-4 (p = 0.003, and p < 0.001, respectively). Conversely, FKGL scores were higher for Claude Sonnet-4 and Grok 4 compared to ChatGPT-5 (p < 0.001, and p = 0.008, respectively). Correlation analyses revealed a strong positive association (r(s) = 0.77; p < 0.001) between accuracy and completeness, a weak negative correlation (r(s) = -0.19; p = 0.047) between completeness and FKGL, and a strong negative correlation between (r(s) = -0.88; p < 0.001) FKGL and FRES. Additionally, ChatGPT-5 demonstrated lower GFI and CLI scores than the other models, while its SMOG scores were lower than those of Gemini 2.5 Flash and Grok 4 (p = 0.001, and p < 0.001, respectively). Conclusions: Although differences were observed between the LLMs in terms of the accuracy and completeness of the responses, ChatGPT-5 showed the best performance. Even with high scores of accuracy (excellent) and completeness (comprehensive), it must not be forgotten that incorrect information can lead to serious outcomes in healthcare services. Therefore, the readability of responses is of critical importance, and when selecting a model, readability should be evaluated together with content quality.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。