Multimodal large language model versus emergency physicians for burn assessment: a prospective non-inferiority study

多模态大型语言模型与急诊医师在烧伤评估中的比较:一项前瞻性非劣效性研究

阅读:1

Abstract

BACKGROUND: Accurate burn size and depth assessment at first contact guides fluid resuscitation, referral, and operative planning, yet both tasks show meaningful inter-clinician variability. General-purpose multimodal large language models may offer scalable, image-based decision support in emergency care, but prospective benchmarking against clinicians and a robust reference standard remains limited. METHODS: We conducted a prospective, single-centre diagnostic accuracy and agreement study in a tertiary emergency department (22 July-8 September 2025). Consecutive acute burn presentations (< 24 h) were screened; protocol-conformant cases contributed standardized three-view photographs per anatomically distinct burn region. A multimodal large language model generated region-level estimates of total body surface area (TBSA) contribution and burn depth class. Eighteen emergency physicians independently rated the same images and minimal metadata, blinded to model and reference outputs. A three-member expert panel served as the reference standard by consensus. The primary endpoint was non-inferiority of the model versus the physician median for region-level absolute TBSA error relative to the panel, with a pre-specified margin of 3 percentage points, using patient-level cluster bootstrap for inference. Secondary endpoints included TBSA agreement and depth agreement (quadratic-weighted kappa). RESULTS: Of 413 screened presentations, 52 patients were enrolled, yielding 64 analyzable burn region-cases (35 pediatric, 29 adult). The model's mean absolute TBSA error versus the panel was 1.40 percentage points (median 1.00); 87.5% of cases were within ± 3 percentage points and 98.4% within ± 5. The physician median had a mean absolute error of 0.89 percentage points (median 0.75). The paired non-inferiority analysis met the pre-specified criterion (Hodges-Lehmann median Δ = 0.25; one-sided 95% upper bound = 0.50), indicating the model was non-inferior to physicians for TBSA estimation. In contrast, depth agreement versus the panel was slight for the model (quadratic-weighted kappa 0.14), with systematic underestimation of deeper burns, while physician consensus showed substantially higher agreement (quadratic-weighted kappa 0.65). CONCLUSIONS: In this prospective emergency department evaluation, a general-purpose multimodal model achieved non-inferior performance to emergency physicians for region-level TBSA estimation but performed substantially worse for burn depth classification. These findings support a narrowly defined adjunct role for TBSA estimation, while depth-dependent decisions should remain clinician-led and require further method development and external validation.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。