From accuracy to robustness: a comparative study of five advanced large language models on the Chinese dental licensing examination

从准确性到鲁棒性:五种先进大型语言模型在中国牙科执业资格考试中的比较研究

阅读:1

Abstract

BACKGROUND: Large language models (LLMs) have demonstrated considerable promise in various domains, including dentistry. This study aimed to evaluate five advanced LLMs (DeepSeek-R1, GPT-4o, OpenAI o3, GPT-5 Thinking, and Gemini 2.5 Pro) in the context of the Chinese Dental Licensing Examination (CDLE) to explore their potential in dental education and practice. METHODS: A total of 600 questions were selected from the official review book provided by the Chinese National Medical Examination Center. All questions, presented in Chinese, were submitted individually to the five LLMs via their web interfaces. The responses were classified as "correct" or "incorrect" using the official answer keys provided by the review book. We analyzed and compared each model's overall accuracy and accuracy across different subjects and question types using χ(2) or Fisher's exact tests, as appropriate. To assess robustness, 120 of the 600 questions were selected for adversarial testing under two types of perturbations. We employed McNemar's test to measure each model's accuracy degradation during adversarial testing. RESULTS: DeepSeek-R1, GPT-5 Thinking, Gemini 2.5 Pro, and OpenAI o3 demonstrated superior performance, significantly surpassing GPT-4o (p < 0.001), with Gemini 2.5 Pro achieving the highest accuracy at 91.67%. Performance varied across dentistry and its sub-disciplines (prosthodontics and oral anatomy), where GPT-4o significantly lagged behind the other four LLMs (p < 0.05). Gemini 2.5 Pro and GPT-5 Thinking outperformed GPT-4o on A1 and B1 question types (p < 0.05). In adversarial testing, all LLMs showed a slight decrease in accuracy, ranging from 1.66% to 5.84%, but the drop was not significant (p > 0.05). CONCLUSIONS: Using the CDLE as a benchmark, new-generation LLMs achieved markedly higher accuracy. Furthermore, all models exhibited strong robustness against adversarial perturbations. These findings indicate that advanced LLMs hold promise as assistive tools for dental education and practice.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。