Application of Large Language Models in Complex Clinical Cases: Cross-Sectional Evaluation Study

大型语言模型在复杂临床案例中的应用:横断面评估研究

阅读:1

Abstract

BACKGROUND: Large language models (LLMs) have made significant advancements in natural language processing (NLP) and are gradually showing potential for application in the medical field. However, LLMs still face challenges in medicine. OBJECTIVE: This study aims to evaluate the efficiency, accuracy, and cost of LLMs in handling complex medical cases and to assess their potential and applicability as tools for clinical decision support. METHODS: We selected cases from the database of the Department of Cardiothoracic Surgery, the Third Affiliated Hospital of Sun Yat-sen University (2021-2024), and conducted a multidimensional preliminary evaluation of the latest LLMs in clinical decision-making for complex cases. The evaluation included measuring the time taken for the LLMs to generate decision recommendations, Likert scores, and calculating decision costs to assess the execution efficiency, accuracy, and cost-effectiveness of the models. RESULTS: A total of 80 complex cases were included in this study, and the performance of multiple LLMs in clinical decision-making was evaluated. Experts required 33.60 minutes on average (95% CI 32.57-34.63), far longer than any LLM. GPTo1 (0.71, 95% CI 0.67-0.74), GPT4o (0.88, 95% CI 0.83-0.92), and Deepseek (0.94, 95% CI 0.90-0.96) all finished under a minute without statistical differences. Although Kimi, Gemini, LLaMa3-8B, and LLaMa3-70B took 1.02-3.20 minutes, they were still faster than experts. In terms of decision accuracy, Deepseek-R1 had the highest accuracy (mean Likert score=4.19), with no significant difference compared to GPTo1 (P=.699), and both performed significantly better than GPT4o, Kimi, Gemini, LLaMa3-70B, and LLaMa3-8B (P<.001). Deepseek-R1 and GPTo1 demonstrated the lowest hallucination rates-6/80 (8%) and 5/80 (6%), respectively-significantly outperforming GPT-4o (7/80, 9%), Kimi (10/80, 12%), and the Gemini and LLaMa3 models, which exhibited substantially higher rates ranging from 13/80 (16%) to 25/80 (31%). Regarding decision costs, all LLMs showed significantly lower costs than the Multidisciplinary Team, with open-source models such as Deepseek-R1 offering a zero direct cost advantage. CONCLUSIONS: GPTo1 and Deepseek-R1 show strong clinical potential, boosting efficiency, maintaining accuracy, and reducing costs. GPT4o and Kimi performed moderately, indicating suitability for broader clinical tasks. Further research is needed to validate LLaMa3 series and Gemini in clinical decision.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。