Abstract
OBJECTIVE: Large language models (LLMs) have advanced rapidly, but their utility in pediatric surgery remains uncertain. This study assessed the performance of three AI models-DeepSeek, Microsoft Copilot (GPT-4) and Google Bard-on the European Pediatric Surgery In-Training Examination (EPSITE). METHODS: We evaluated model performance using 294 EPSITE questions from 2021 to 2023. Data for Copilot and Bard were collected in early 2024, while DeepSeek was assessed in 2025. Responses were compared to those of pediatric surgical trainees. Statistical analyses determined performance differences. RESULTS: DeepSeek achieved the highest accuracy (85.0%), followed by Copilot (55.4%) and Bard (48.0%). Pediatric surgical trainees averaged 60.1%. Performance differences were statistically significant (p < 0.0001). DeepSeek significantly outperformed both human trainees and other models (p < 0.0001), while Bard was consistently outperformed by trainees across all training levels (p < 0.01). Sixth-year trainees performed better than Copilot (p < 0.05). Copilot and Bard failed to answer a small portion of questions (3.4% and 4.7%, respectively) due to ethical concerns or perceived lack of correct choices. The time gap between model assessments reflects the rapid evolution of LLMs, contributing to the superior performance of newer models like DeepSeek. CONCLUSION: LLMs show variable performance in pediatric surgery, with newer models like DeepSeek demonstrating marked improvement. These findings highlight the rapid progression of LLM capabilities and emphasize the need for ongoing evaluation before clinical integration, especially in high-stakes decision-making contexts.