Abstract
INTRODUCTION: This study hypothesized that large language models (LLMs) would underperform compared to expert clinicians in diagnosing and managing complex endodontic anomalies, such as dens invaginatus, when provided with periapical radiographs. Although LLMs have shown promise in dental education and basic diagnostics, their effectiveness in nuanced clinical reasoning has remained unclear. METHODS: Nineteen anonymized periapical radiographs depicting challenging endodontic conditions were paired with clinical vignettes. Six advanced LLMs and one expert endodontist independently answered six structured clinical questions per case. Each response was scored against a reference key. Accuracy rates were compared using Kruskal-Wallis and Mann-Whitney U tests. Chi-square tests were used to evaluate model performance across question types. RESULTS: The expert achieved 100% accuracy, while all LLMs performed significantly lower (P < 0.05). Copilot demonstrated the lowest scores across all questions. The most substantial performance drop was observed in anomaly classification tasks, particularly in identifying and categorizing dens invaginatus. No significant performance differences were found among the top-performing LLMs. CONCLUSIONS: While LLMs showed competence in basic diagnostic tasks, they failed to replicate expert-level decision-making in complex endodontic scenarios. Their current capabilities remain insufficient for unsupervised clinical use. This study is among the first to assess LLMs using real radiographic data in endodontics and highlights the need for further multimodal model development. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12903-025-06987-z.