Automated Tumor and Node Staging from Esophageal Cancer Endoscopic Ultrasound Reports: A Benchmark of Advanced Reasoning Models with Prompt Engineering and Cross-Lingual Evaluation

基于食管癌内镜超声报告的肿瘤和淋巴结自动分期：高级推理模型的基准测试、快速工程和跨语言评估

阅读：2

作者：Hu,Xudong,Feng,Lingde,Jing,Bingzhong,Luo,Linna,Tan,Wencheng,Li,Yin,Zheng,Xinyi,Huang,Xinxin,Lin,Shiyong,Wu,Huiling,He,Longjun

期刊：	Diagnostics	影响因子：	3.300
时间：	2026	起止号：	2026 Jan 9;16(2)
doi：	10.3390/diagnostics16020215	研究方向：	肿瘤
疾病类型：	食管癌

Abstract

Objectives: To benchmark the performance of DeepSeek-R1 against three other advanced AI reasoning models (GPT-4o, Qwen3, Grok-3) in automatically extracting T/N staging from esophageal cancer endoscopic ultrasound (EUS) complex medical reports, and to evaluate the impact of language (Chinese/English) and prompting strategy (with/without designed prompt) on model accuracy and robustness. Methods: We retrospectively analyzed 625 EUS reports for T-staging and 579 for N-staging, which were collected from 663 patients at the Sun Yat-sen University Cancer Center between 2018 and 2020. A 2 × 2 factorial design (Language × Prompt) was employed under a zero-shot setting. The performance of the models was evaluated using accuracy, and the odds ratio (OR) was calculated to quantify the comparative performance advantage between models across different scenarios. Results: Performance was evaluated across four scenarios: (1) Chinese with-prompt, (2) Chinese without-prompt, (3) English with-prompt, and (4) English without-prompt. In both T and N-staging tasks, DeepSeek-R1 demonstrated superior overall performance compared to the competitors. For T-staging, the average accuracy was (DeepSeek-R1 vs. GPT-4o vs. Qwen3 vs. Grok-3: 91.4% vs. 84.2% vs. 89.5% vs. 81.3%). For N-staging, the respective average accuracy was 84.2% vs. 65.0% vs. 68.4% vs. 51.9%. Notably, N-staging proved more challenging than T-staging for all models, as indicated by lower accuracy. This superiority was most pronounced in the Chinese without-prompt T-staging scenario, where DeepSeek-R1 achieved significantly higher accuracy than GPT-4o (OR = 7.84, 95% CI [4.62-13.30], p < 0.001), Qwen3 (OR = 5.00, 95% CI [2.85-8.79], p < 0.001), and Grok-3 (OR = 6.47, 95% CI [4.30-9.74], p < 0.001). Conclusions: This study validates the feasibility and effectiveness of large language models (LLMs) for automated T/N staging from EUS reports. Our findings confirm that DeepSeek-R1 possesses strong intrinsic reasoning capabilities, achieving the most robust performance across diverse conditions, with the most pronounced advantage observed in the challenging English without-prompt N-staging task. By establishing a standardized, objective benchmark, DeepSeek-R1 mitigates inter-observer variability, and its deployment provides a reliable foundation for guiding precise, individualized treatment planning for esophageal cancer patients.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。