Performance of DeepSeek and GPT Models on Pediatric Board Preparation Questions: Comparative Evaluation

DeepSeek 和 GPT 模型在儿科医师资格考试题上的表现:比较评估

阅读:1

Abstract

BACKGROUND: Limited research exists evaluating artificial intelligence (AI) performance on standardized pediatric assessments. This study evaluated 3 leading AI models on pediatric board preparation questions. OBJECTIVE: The aim of this study is to evaluate and compare the performance of 3 leading large language models (LLMs) on pediatric board examination preparation questions and contextualize their performance against human physician benchmarks. METHODS: We analyzed DeepSeek-R1, ChatGPT-4, and ChatGPT-4.5 using 266 multiple-choice questions from the 2023 PREP Self-Assessment. Performance was compared to published American Board of Pediatrics first-time pass rates. RESULTS: DeepSeek-R1 exhibited the highest accuracy at 98.1% (261/266 correct responses). ChatGPT-4.5 achieved 96.6% accuracy (257/266), performing at the upper threshold of human performance. ChatGPT-4 demonstrated 82.7% accuracy (220/266), comparable to the lower range of human pass rates. Error pattern analysis revealed that AI models most commonly struggled with questions requiring integration of complex clinical presentations with rare disease knowledge. CONCLUSIONS: DeepSeek-R1 demonstrated exceptional performance exceeding typical American Board of Pediatrics pass rates, suggesting potential applications in medical education and clinical support, though further research on complex clinical reasoning is needed.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。