Epistemic and ethical limits of large language models in evidence-based medicine: from knowledge to judgment

大型语言模型在循证医学中的认知和伦理局限性：从知识到判断

阅读：3

期刊：	Frontiers in Digital Health	影响因子：	3.800
时间：	2025	起止号：	2025;7:1706383
doi：	10.3389/fdgth.2025.1706383	研究方向：	神经科学

Abstract

BACKGROUND: The rapid evolution of general large language models (LLMs) provides a promising framework for integrating artificial intelligence into medical practice. While these models are capable of generating medically relevant language, their application in evidence inference in clinical scenarios may pose potential challenges. This study employs empirical experiments to analyze the capability boundaries of current general-purpose LLMs within evidence-based medicine (EBM) tasks, and provides a philosophical reflection on their limitations. METHODS: This study evaluates the performance of three general-purpose LLMs, including ChatGPT, DeepSeek, and Gemini, when directly applied to core tasks of EBM. The models were tested in a baseline, unassisted setting, without task-specific fine-tuning, external evidence retrieval, or embedded prompting frameworks. Two clinical scenarios, namely SGLT2 inhibitors for heart failure and PD-1/PD-L1 inhibitors for advanced NSCLC were used to assess performance in evidence generation, evidence synthesis, and clinical judgment. Model outputs were evaluated using a multidimensional rubric. The empirical results were analyzed from an epistemological perspective. RESULTS: Experiments show that the evaluated general-purpose LLMs can produce syntactically coherent and medically plausible outputs in core evidence-related tasks. However, under current architectures and baseline deployment conditions, several limitations remain, including imperfect accuracy in numerical extraction and processing, limited verifiability of cited sources, inconsistent methodological rigor in synthesis, and weak attribution of clinical responsibility in recommendations. Building on these empirical patterns, the philosophical analysis reveals three potential risks in this testing setting, including disembodiment, deinstitutionalization, and depragmatization. CONCLUSIONS: This study suggests that directly applying general-purpose LLMs to clinical evidence tasks entails some limitations. Under current architectures, these systems lack embodied engagement with clinical phenomena, do not participate in institutional evaluative norms, and cannot assume responsibility for reasoning. These findings provide a directional compass for future medical AI, including ground outputs in real-world data, integrate deployment into clinical workflows with oversight, and design human-AI collaboration with clear responsibility.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。