Abstract
BACKGROUND: Artificial intelligence–based chatbots are increasingly being used as supportive tools in healthcare for accessing and interpreting medical information. However, variations in language and temporal updates may influence the accuracy and consistency of the information they provide. This study aimed to evaluate the effects of language difference (Turkish and English) and time (10 days) on the accuracy of responses provided by six different artificial intelligence (AI) chatbots (ChatGPT, ChatGPT-4o, ChatGPT-5, Gemini, Microsoft Copilot, and Perplexity) to questions related to vital pulp therapy. METHODS: Twenty questions were prepared in accordance with the clinical guidelines of the American Association of Endodontists (AAE) and were presented to each model in both Turkish and English three times a day (morning, noon, and evening) over a 10-day period. A total of 7,200 responses were collected. The responses were classified as correct or incorrect by two independent researchers. The accuracy rates of the groups were analyzed using descriptive statistics, and Chi-square tests to compare correct versus incorrect response ratios. RESULTS: Microsoft Copilot achieved the highest accuracy rate (87.3%), while Perplexity showed the lowest (77.2%). When evaluated by language, ChatGPT-4o performed best in Turkish questions (83.8%), whereas Microsoft Copilot demonstrated the highest accuracy in English questions (87.3%). The time variable had no statistically significant effect on accuracy. CONCLUSIONS: These results indicate that AI chatbots can serve as supportive tools in clinical decision-making processes; however, language differences and model-dependent performance variations should be carefully considered. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12903-026-08060-9.