ChatGPT-4o, Gemini Advanced and DeepSeek R1 in preoperative decision-making for thyroid surgery: a comparative assessment with human surgeons

ChatGPT-4o、Gemini Advanced 和 DeepSeek R1 在甲状腺手术术前决策中的应用：与人类外科医生的比较评估

阅读：3

作者：Zou,Long,Zhang,Peng,Jiang,Yu-Qi,Wang,Xiao-Wen,Yan,Xi-Jing,Wu,Jie-Zhong,Qi,Jia,Li,Wen-Chao,Cai,Qing-Qing,Xuan,Zhi-Rong,Hu,Kun-Peng

期刊：	Frontiers in Oncology	影响因子：	3.300
时间：	2025	起止号：	2025;15:1590230
doi：	10.3389/fonc.2025.1590230	研究方向：	免疫/内分泌

Abstract

The integration of large language models (LLMs) into surgical decision-making is an emerging field with potential clinical value. This study assessed the preoperative decision-making consistency of ChatGPT-4o, Gemini Advanced, and DeepSeek R1 in comparison with expert consensus, using clinical data from 123 patients undergoing thyroid surgery. Overall concordance rates were 47.97% for ChatGPT-4o, 24.39% for Gemini Advanced, and 56.10% for DeepSeek R1. In thyroidectomy extent decisions, all three models showed moderate consistency with the surgical team, with agreement rates of 61.79% (κ=0.484) for ChatGPT-4o, 67.48% (κ=0.548) for Gemini, and 67.48% (κ=0.535) for DeepSeek R1 (all p < 0.001). However, significant divergence was observed in lymph node dissection planning: ChatGPT-4o achieved a high concordance rate of 69.11% (κ=0.616), DeepSeek R1 showed the highest at 79.67% (κ=0.741), while Gemini's performance was relatively poor at 34.96% (κ=0.188). Though our findings demonstrate that ChatGPT-4o and DeepSeek R1 exhibit substantial agreement with experienced surgeons in preoperative planning, overall performance still leaves room for improvement. Nevertheless, model-specific variability-particularly in oncologic decision-making-highlights the need for refinement and robust clinical validation before widespread clinical adoption.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。