Multimodal large language model versus emergency physicians for burn assessment: a prospective non-inferiority study

多模态大型语言模型与急诊医师在烧伤评估中的比较：一项前瞻性非劣效性研究

阅读：1

作者：Aykut,Ahmet,Karayıl,Ali Rıza,Yıldırım,Cem,Günsoy,Ertuğ,Tatlı,Mehmet,Avcı,Murat

期刊：	Scandinavian Journal of Trauma Resuscitation & Emergency Medicine	影响因子：	3.100
时间：	2026	起止号：	2026 Feb 5;34(1)
doi：	10.1186/s13049-026-01577-6

Abstract

BACKGROUND: Accurate burn size and depth assessment at first contact guides fluid resuscitation, referral, and operative planning, yet both tasks show meaningful inter-clinician variability. General-purpose multimodal large language models may offer scalable, image-based decision support in emergency care, but prospective benchmarking against clinicians and a robust reference standard remains limited. METHODS: We conducted a prospective, single-centre diagnostic accuracy and agreement study in a tertiary emergency department (22 July-8 September 2025). Consecutive acute burn presentations (< 24 h) were screened; protocol-conformant cases contributed standardized three-view photographs per anatomically distinct burn region. A multimodal large language model generated region-level estimates of total body surface area (TBSA) contribution and burn depth class. Eighteen emergency physicians independently rated the same images and minimal metadata, blinded to model and reference outputs. A three-member expert panel served as the reference standard by consensus. The primary endpoint was non-inferiority of the model versus the physician median for region-level absolute TBSA error relative to the panel, with a pre-specified margin of 3 percentage points, using patient-level cluster bootstrap for inference. Secondary endpoints included TBSA agreement and depth agreement (quadratic-weighted kappa). RESULTS: Of 413 screened presentations, 52 patients were enrolled, yielding 64 analyzable burn region-cases (35 pediatric, 29 adult). The model's mean absolute TBSA error versus the panel was 1.40 percentage points (median 1.00); 87.5% of cases were within ± 3 percentage points and 98.4% within ± 5. The physician median had a mean absolute error of 0.89 percentage points (median 0.75). The paired non-inferiority analysis met the pre-specified criterion (Hodges-Lehmann median Δ = 0.25; one-sided 95% upper bound = 0.50), indicating the model was non-inferior to physicians for TBSA estimation. In contrast, depth agreement versus the panel was slight for the model (quadratic-weighted kappa 0.14), with systematic underestimation of deeper burns, while physician consensus showed substantially higher agreement (quadratic-weighted kappa 0.65). CONCLUSIONS: In this prospective emergency department evaluation, a general-purpose multimodal model achieved non-inferior performance to emergency physicians for region-level TBSA estimation but performed substantially worse for burn depth classification. These findings support a narrowly defined adjunct role for TBSA estimation, while depth-dependent decisions should remain clinician-led and require further method development and external validation.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。

肿瘤免疫

炎症

T细胞

线粒体

凋亡

转录调控

巨噬细胞

传染病

自噬

氧化应激

磷酸化

血管生成

肠道菌群

囊泡

中性粒细胞

3D/类器官

单细胞

药物研究

外泌体

DNA甲基化

细胞衰老

铁死亡

缺氧低氧

miRNA

乙酰化

组蛋白修饰

泛素化

炎性小体

代谢重编程

树突状细胞

焦亡

肿瘤微环境

m6A/m5C/m7G

lncRNA

空间多组学

细胞基因治疗

内质网应激

相分离

治疗耐药

免疫代谢

Treg

上皮间质转化

染色质重塑

脂质过氧化

蛋白质稳态

铁代谢

cGAS-STING

碱基编辑

脂代谢

乳酸化

细胞极性

蛋白降解

低氧缺氧

circRNA

肠脑轴

氨基酸代谢

piRNA

翻译调控

NK 细胞

肿瘤异质性

MDSC

NETosis

RNA 编辑

氧化脂质

溶酶体功能

细胞干性

琥珀酰化

CAR-NK

冷应激

器官芯片

Tfh

巴豆酰化

表观遗传记忆

线粒体未折叠蛋白反应

铜死亡

器官纤维化

空间代谢组

自噬流

程序性坏死

MAIT 细胞

肠肝轴

丙酰化