Accuracy and Reproducibility of Different Artificial Intelligence Chatbots' Responses to Patient-Based Vitreoretinal Questions: A Comparative Study

不同人工智能聊天机器人对患者提出的玻璃体视网膜问题的回答的准确性和可重复性：一项比较研究

阅读：1

作者：Al-Latayfeh,Motasem,Aleshawi,Abdelwahab,El-Mulki,Omar S,Baker,Mohammed,Qaddoumi,Zaina,Attar,Dalia,Alma'aitah,Lina,Jarrah,Elaf Z,Abu Khalil,Zainah,Awad,Walaa,Dayeh,Mo'men Raed,Al Beiruti,Seren,Al-Dwairi,Rami

期刊：		影响因子：
时间：	2026	起止号：	2026;20:580133
doi：	10.2147/OPTH.S580133

Abstract

BACKGROUND: Generative artificial intelligence (AI) chatbots are increasingly used by patients and their reliability in complex ophthalmic conditions remains uncertain. This study aimed to compare the accuracy, comprehensiveness, and reproducibility of five AI chatbots-ChatGPT-5.o, DeepSeek R1, Meta AI, Grok 3.0, and Google Gemini 2.5 Pro-in responding to patient-centered vitreoretinal questions. METHODS: A total of 135 questions covering diabetic retinopathy, floaters/flashes, age-related macular degeneration, retinal tear/detachment, and vitrectomy were sourced from the American Academy of Ophthalmology "Ask an Ophthalmologist" database. Each question was submitted twice to each chatbot under standardized instructions. Two board-certified vitreoretinal ophthalmologists independently graded responses for accuracy and reproducibility. Accuracy was calculated as the proportion of responses graded "Correct and Comprehensive" or "Accurate but incomplete"; reproducibility was defined as agreement between the two responses. RESULTS: ChatGPT-5.o achieved the highest overall accuracy (94%, n=127/135, 95% CI: 89.9%-98.1%) with a reproducibility rate of 96.3% (n=130/135, 95% CI: 93.1%-99.5%). DeepSeek R1 demonstrated the greatest reproducibility (98.5%, n=133/135, 95% CI: 96.5%-100.0%) and high accuracy (92.6%, n=125/135, 95% CI: 88.1%-97.1%). Meta AI showed 91% (95% CI: 86.1%-95.9%) accuracy and 94% (95% CI: 89.9%-98.1%) reproducibility, whereas Grok 3.0 yielded the lowest accuracy (49.6%, n=67/135, 95% CI: 41.2%-58.0%) despite moderate reproducibility (88.1%, n=119/135, 95% CI: 82.7%-93.5%). Google Gemini 2.5 Pro recorded 72.6% (95% CI: 65.1%-80.1%) accuracy and the lowest reproducibility (77%, 95% CI: 69.9%-84.1%). By category, "Vitrectomy" scored the highest across all chatbots (94%, 95% CI: 87.2%-100.0%), followed by "Macular degeneration" (90%, 95% CI: 85.0%-95.0%). However, the category "Diabetic retinopathy" scored the lowest accuracy rate (64.7%, 95% CI: 52.1%-77.3%). CONCLUSION: ChatGPT-5.o and DeepSeek R1 approached high accuracy and reproducibility comparable to clinical standards, indicating potential as patient-education tools in vitreoretinal care. However, variability across models and disease categories highlights the need for cautious clinical adoption and continued optimization to ensure safe, reliable information delivery.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。

肿瘤免疫

炎症

T细胞

凋亡

线粒体

转录调控

巨噬细胞

自噬

传染病

氧化应激

血管生成

肠道菌群

磷酸化

囊泡

3D/类器官

单细胞

中性粒细胞

外泌体

药物研究

DNA甲基化

细胞衰老

miRNA

铁死亡

缺氧低氧

乙酰化

泛素化

组蛋白修饰

炎性小体

代谢重编程

树突状细胞

肿瘤微环境

焦亡

lncRNA

m6A/m5C/m7G

空间多组学

细胞基因治疗

内质网应激

相分离

治疗耐药

Treg

免疫代谢

上皮间质转化

染色质重塑

脂质过氧化

蛋白质稳态

铁代谢

脂代谢

cGAS-STING

肠脑轴

细胞极性

氨基酸代谢

碱基编辑

乳酸化

蛋白降解

circRNA

翻译调控

低氧缺氧

piRNA

肿瘤异质性

NK 细胞

MDSC

氧化脂质

溶酶体功能

NETosis

RNA 编辑

细胞干性

琥珀酰化

CAR-NK

冷应激

Tfh

器官芯片

巴豆酰化

表观遗传记忆

空间代谢组

器官纤维化

线粒体未折叠蛋白反应

铜死亡

程序性坏死

自噬流

肠肝轴

MAIT 细胞

丙酰化