Abstract
BACKGROUND: Large language models (LLMs) have demonstrated considerable promise in various domains, including dentistry. This study aimed to evaluate five advanced LLMs (DeepSeek-R1, GPT-4o, OpenAI o3, GPT-5 Thinking, and Gemini 2.5 Pro) in the context of the Chinese Dental Licensing Examination (CDLE) to explore their potential in dental education and practice. METHODS: A total of 600 questions were selected from the official review book provided by the Chinese National Medical Examination Center. All questions, presented in Chinese, were submitted individually to the five LLMs via their web interfaces. The responses were classified as "correct" or "incorrect" using the official answer keys provided by the review book. We analyzed and compared each model's overall accuracy and accuracy across different subjects and question types using χ(2) or Fisher's exact tests, as appropriate. To assess robustness, 120 of the 600 questions were selected for adversarial testing under two types of perturbations. We employed McNemar's test to measure each model's accuracy degradation during adversarial testing. RESULTS: DeepSeek-R1, GPT-5 Thinking, Gemini 2.5 Pro, and OpenAI o3 demonstrated superior performance, significantly surpassing GPT-4o (p < 0.001), with Gemini 2.5 Pro achieving the highest accuracy at 91.67%. Performance varied across dentistry and its sub-disciplines (prosthodontics and oral anatomy), where GPT-4o significantly lagged behind the other four LLMs (p < 0.05). Gemini 2.5 Pro and GPT-5 Thinking outperformed GPT-4o on A1 and B1 question types (p < 0.05). In adversarial testing, all LLMs showed a slight decrease in accuracy, ranging from 1.66% to 5.84%, but the drop was not significant (p > 0.05). CONCLUSIONS: Using the CDLE as a benchmark, new-generation LLMs achieved markedly higher accuracy. Furthermore, all models exhibited strong robustness against adversarial perturbations. These findings indicate that advanced LLMs hold promise as assistive tools for dental education and practice.