Comparative evaluation of multimodal large language models for diagnostic accuracy in pediatric electrocardiography: a prospective comparative diagnostic accuracy study

多模态大型语言模型在儿科心电图诊断准确性方面的比较评价:一项前瞻性比较诊断准确性研究

阅读:1

Abstract

We evaluated three multimodal LLMs, ChatGPT (GPT-5.2), Gemini 3, and Microsoft Copilot, in pediatric ECG interpretation, focusing on clinically significant abnormalities and emergency arrhythmias with likelihood ratios as primary outcome measures. This prospective comparative diagnostic accuracy study (STARD/STARD-AI) included 264 pediatric patients with 12-lead ECGs (November 2024-November 2025). De-identified images were submitted via standardized zero-shot prompt. Three blinded pediatric cardiologists established the reference diagnosis by majority-vote consensus. Cases were classified as Tier 1 (normal), Tier 2 (abnormal, non-urgent), or Tier 3 (urgent). Two binary endpoints were assessed: clinically significant abnormality (Tier 2 + 3 vs Tier 1) and emergency abnormality (Tier 3 vs Tier 1 + 2). Clinically significant abnormalities were present in 54.5% of patients. AUC values ranged from 0.550 to 0.623, reflecting modest discrimination. For the clinically significant endpoint, + LR values were 2.05 (ChatGPT), 1.26 (Gemini), and 1.21 (Copilot); - LR values were 0.68, 0.55, and 0.81, indicating limited rule-in and insufficient rule-out utility. For the emergency endpoint, Gemini achieved 100% sensitivity (95% CI = 85.1-100.0) with - LR 0.07 (95% CI = 0.00-1.12) in a small subgroup (n = 22); however, specificity of 30.2% and + LR of 1.40 indicate overcalling rather than diagnostic precision. No model achieved clinically meaningful rule-in utility for either endpoint. CONCLUSIONS: Current multimodal LLMs showed limited diagnostic utility in pediatric ECG interpretation, with + LR values near 1.0 across both endpoints. Standalone deployment is not supported; these tools may at most serve as adjunctive screening aids under clinician oversight. WHAT IS KNOWN: • Deep learning algorithms trained on large ECG datasets perform well in adult populations, but evidence in pediatric ECG interpretation is limited. • General-purpose LLMs show variable accuracy in medical examinations; reliability in subspecialty domains such as pediatric cardiology remains unproven. WHAT IS NEW: • This is the[FCA1] first head-to-head comparative diagnostic accuracy study of multimodal LLMs in pediatric ECG evaluation, using likelihood ratios as primary outcome measures. • All three LLMs showed limited rule-in utility (+LR near 1.0); Gemini achieved potentially meaningful rule-out performance for emergency arrhythmias (-LR = 0.07), but with wide confidence intervals reflecting the small emergency subgroup (n = 22). • Gemini's 100% sensitivity in the emergency subgroup reflects overcalling (specificity 30.2%) consistent with a triage/screening behavior rather than diagnostic precision.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。