Abstract
BACKGROUND: Viral hepatitis is a major global public health problem that affects millions of people; therefore, accurate and accessible information is essential for both the general public and non-specialist healthcare providers to correctly understand, prevent, and manage the disease. This study evaluated four large language models (LLMs)-Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5, and ChatGPT-4-and compared their responses to viral hepatitis-related questions to assess differences in performance across models. METHODS: This comparative evaluation study, conducted at Nanjing Drum Tower Hospital from March to April 2025, examined 52 questions pertaining to viral hepatitis. Four large language models were assessed based on their responses to these 52 questions which encompassed four domains: concepts, risk factors, diagnosis, and prevention and treatment. Initial evaluation used a three-point scale of good, borderline, and poor. Further evaluation criteria included relevance, comprehensiveness, accuracy, safety, and readability, with each response scored on a scale of 1 to 5. RESULTS: ChatGPT-4.5 achieved the highest performance, with 89.1% of its responses rated as good, significantly outperforming Claude-3.5-sonnet (71.15% good), Gemini-2.0 (62.82% good), and ChatGPT-4 (50.64% good). Statistical analysis confirmed superior performance of ChatGPT-4.5 in all evaluated dimensions. Consistently, ChatGPT-4.5 scored the highest across all five criteria: relevance, comprehensiveness, accuracy, safety, and readability. CONCLUSIONS: ChatGPT-4.5 demonstrates superior performance in addressing viral hepatitis queries compared to other three models. Its high reliability makes it a valuable tool for patients and medical professionals not specializing in viral hepatitis by improving information accessibility.