Abstract
OBJECTIVE: To evaluate the performance of generative AI tools, specifically Ernie Bot and ChatGPT, in supporting online medical consultations in China, focusing on their accuracy, safety, and empathy, and to assess their potential role in addressing the supply-demand gap in the healthcare system. METHODS: We collected 233 trigeminal neuralgia consultations from a Chinese medical platform, including patient questions and doctor replies. Each question was input into ChatGPT-3.5 and Ernie Bot with role-specific prompts to generate large language models (LLMs) responses. Four blinded raters-two doctors and two patients-evaluated all responses using DISCERN and a modified PEMAT. Lexical, syntactic, and semantic analyses were conducted, with Spearman correlations assessing links between linguistic features and perceived quality. RESULTS: While doctors led in reliability, Ernie Bot scored highest overall, especially in empathy and clarity, likely due to stylistic choices rather than true understanding. Despite their fluency, LLMs remain prone to factual errors. Text analysis showed distinct linguistic patterns, with several features significantly correlated with perceived quality. CONCLUSION: LLMs demonstrate strengths in perceived empathy and clarity but fall short in clinical accuracy and depth when addressing complex cases. Although they outperform doctors in communication-related aspects, their limitations in high-risk decision-making remain evident. As such, LLMs hold promise as adjunct tools for non-urgent consultations, but further refinement is needed to meet the standards of precise and personalized healthcare delivery.