MEDAI-LLM-SUMM: a reporting checklist for medical text summarization studies using large language models

MEDAI-LLM-SUMM:使用大型语言模型进行医学文本摘要研究的报告清单

阅读:1

Abstract

BACKGROUND: Medical text summarization using large language models (LLMs) has reached an inflection point in 2024-2025, with adapted models demonstrating capability to match or exceed human expert performance in specific tasks. However, critical gaps persist in safety validation, evaluation frameworks, and clinical deployment readiness. A comprehensive review revealed that only 7% of studies conducted external validation and 3% performed patient safety assessments, with hallucination rates ranging from 1.47% to 61.6%. Existing reporting guidelines, including CONSORT-AI, SPIRIT-AI, TRIPOD-LLM, and DEAL, do not adequately address the specific requirements of medical text summarization tasks. OBJECTIVE: to develop MEDAI-LLM-SUMM, the first specialized reporting checklist for research on medical text summarization using LLMs, addressing critical gaps in existing reporting standards. METHODS: A modified iterative consensus approach was employed, comprising three sequential stages: (1) a systematic literature review of 216 publications from PubMed and eLibrary (2023-2025) following PRISMA guidelines and an analysis of existing reporting standards (TRIPOD-LLM, DEAL, CONSORT-AI, SPIRIT-AI, TRIPOD + AI, CLAIM, STARD-AI); (2) development of an initial 44-item, 7-section checklist by a supervisory group; (3) three rounds of face-to-face consensus discussions with a multidisciplinary expert panel of 11 specialists (3 radiologists, 2 clinicians, 3 medical informatics experts, 1 biostatistician, and 2 medical LLM developers). The consensus criterion required unanimous agreement from all panel members. RESULTS: The final MEDAI-LLM-SUMM checklist comprises 24 items organized into six sections: (A) Clinical validity (4 items addressing clinical task definition, expert involvement, hypothesis formulation, and medical expertise requirements); (B) Model Selection (5 items covering model justification, system requirements, deployment environment, LLM-as-judge approach, and prompt documentation); (C) Data (3 items on datasets, reference summaries with expert consensus, and data stratification); (D) Quality Assessment (8 items including evaluation metrics, clinical metrics, expert evaluation, hallucination detection, LLM-judge assessment, sample size justification, pilot testing, and limitations documentation); (E) Safety (2 items on ethical approval and data anonymization); and (F) Data Availability (2 items on code and dataset accessibility). Comparative analysis with six existing reporting standards demonstrated that MEDAI-LLM-SUMM uniquely addresses hallucination assessment requirements, reference summary creation methodology, LLM-as-judge validation protocols, and detailed pilot testing specifications.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。