Large Language Models Perform at Chance Level in the Diagnosis of Pediatric Pneumonia Using Chest Radiographs

大型语言模型在利用胸部X光片诊断儿童肺炎方面表现与随机水平相当

阅读：1

作者：Gillette,Justin,Lu,Michelle,Heston,Thomas F

期刊：	Cureus Journal of Medical Science	影响因子：	1.300
时间：	2025	起止号：	2025 Sep;17(9):e92596
doi：	10.7759/cureus.92596	疾病类型：	肺炎

Abstract

Introduction Pneumonia remains a significant cause of morbidity and mortality in children globally. Chest radiographs (CXRs) are widely used to diagnose pediatric pneumonia; however, distinguishing between bacterial and viral etiologies on imaging is a diagnostically challenging task. Large language models (LLMs), particularly those integrated with vision capabilities, have shown promise in preliminary studies for interpreting CXR findings. However, the diagnostic performance of general-purpose LLMs without specialized medical training or add-ons remains poorly understood. This study examined whether such LLMs could independently and reliably distinguish between bacterial, viral, and normal CXRs in pediatric patients. Methods We evaluated four publicly available LLMs, such as ChatGPT o3, Claude 3.7 Sonnet, Gemini 2.5 Pro, and Grok 3, on a dataset of 44 pediatric CXRs confirmed by human readers to show bacterial pneumonia (n = 17), viral pneumonia (n = 13), or no abnormality (n = 14). Each image was analyzed twice by each LLM using a standardized prompt, resulting in a total of eight readings per image. Diagnostic accuracy was assessed relative to human expert consensus. Internal consistency was measured by comparing repeated interpretations. A prespecified adaptive stopping rule was employed based on performance futility criteria. Sample size calculations and statistical analyses were conducted using G*Power. Results Across all models and CXR types, the average diagnostic accuracy was 31%, consistent with chance-level performance in a three-choice classification task. Accuracy was highest for viral pneumonia (54%) and lowest for normal CXRs (18%). Internal consistency ranged from 46% to 71% across models, indicating unreliable performance. Concordance with human expert interpretation did not exceed 49% for any of the models. Futility criteria were met after 44 cases, prompting early termination of data collection. Conclusion General-purpose LLMs currently available to the public are not reliable diagnostic tools for pediatric pneumonia on chest radiographs. Their accuracy is low, particularly in ruling out disease, and their responses lack internal consistency. These findings highlight the risks associated with deploying such models in unsupervised clinical or consumer-facing settings. Future research should focus on purpose-built radiologic AI tools trained on diverse, clinically representative datasets and integrated with clinician oversight to ensure the safe and effective use of these tools.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。