Abstract
OBJECTIVES: Pediatric cataract occurs during the critical period of visual development, and early intervention is essential to avoid irreversible visual impairment. The health literacy and self-management ability of children and their parents directly affect treatment adherence and prognosis. With the rapid development of artificial intelligence, this study aims to evaluate the accuracy, completeness, and repeatability of domestic open-source large language model (LLM) in answering common clinical questions from pediatric cataract patients, and to explore their application potential as an online health information resource tool for pediatric cataract patients. METHODS: The research team collected real patient questions from mainstream online medical platforms since 2016, and categorized them into 5 major domains: Risk factors, disease diagnosis, symptoms and staging, screening and examinations, treatment and prognosis. After expert review, 40 high-attention questions were finalized and manual reference answers were provided by experts. Four domestic open-source LLM (Kimi chat, Doubao, ERNIE Bot 3.5, DeepSeek) were selected. Each question was asked repeatedly 4 times, including 2 times with a "patient-physician" role prompt. Three cataract specialists with the title of associate chief physician or above scored the answers blindly using a 4-level accuracy scale, 3-level completeness scale, and 3-level reproducibility scale. The evaluation followed a two-stage assessment scheme: Stage 1 preliminarily tested the 4 LLM using 6 questions of recognized lower difficulty; Stage 2 performed a full evaluation of all 40 questions on the highest-scoring LLM from Stage 1. RESULTS: In the first stage of evaluation, regardless of whether role prompts were included, among the 4 LLM, Kimi chat performed the best, followed by Doubao and ERNIE Bot 3.5, and finally DeepSeek. In Stage 1, regardless of role prompting, Kimi chat performed best, followed by Doubao and ERNIE Bot 3.5, with DeepSeek ranking last. The proportion of answers from Kimi chat scoring accuracy=4, completeness=3, and reproducibility=3 was higher than Doubao, ERNIE Bot 3.5, and DeepSeek. In Stage 2, Kimi chat completed all 40 questions. Its median answer length was 531 (277, 1 059) words, significantly higher than the manual reference 369 (162, 707) words (Z=-4.096, P<0.001). However, answer length showed no significant correlation with accuracy or completeness (both P>0.05). Across 240 model responses, the proportions were: accuracy ≥ 3: 83.8%, completeness=3: 77.9%, and repeatability≥70%: 66.7%. 62.1% (149/240) of evaluators selected Kimi chat answers as their top preference. Reasons for not selecting included off-topic responses, controversial suggestions, and redundant information. CONCLUSIONS: Domestic open-source LLM, especially Kimi chat, demonstrated relatively good performance in pediatric cataract health education scenarios, providing medical information with good accuracy, completeness, and reproducibility for parents. LLM have great potential in the healthcare field, but information security, hallucination, and bias remain key challenges, and they still cannot replace clinical physicians. In the future, LLM are expected to collaborate with physicians to deliver more efficient and personalized medical services and promote the development of healthcare.