Abstract
AIM: To evaluate the diagnostic accuracy of ChatGPT-4o (OpenAI San Francisco, CA, USA), a large language model (LLM), in identifying corneal pathology solely from slit-lamp photographs, without any additional clinical context, and compare accuracy to consultant ophthalmologists. METHODS: This was a prospective diagnostic accuracy study. A total of 22 images were selected from the Atlas at EyeRounds.org (The University of Iowa). Diagnostic accuracy, defined as the proportion of correctly identified cases, was calculated for ChatGPT-4o and two consultant ophthalmologists, against the reference standard. Pairwise comparisons between ChatGPT-4o and each consultant ophthalmologist were performed using McNemar's test to evaluate differences between the AI model and each consultant ophthalmologist. RESULTS: The accuracy of identifying a characteristic sign or diagnosis for ChatGPT was 0.50 (95% CI: 0.28 - 0.72, p-value 1.00), compared to 0.64 (95% CI: 0.41 - 0.83, p-value 0.29) for Ophthalmologist A and 0.55 (95% CI: 0.32 - 0.76, p-value 0.83) for Ophthalmologist B. McNemar's test demonstrated a statistically significant difference between ChatGPT and Ophthalmologist A (p = 0.01), whereas no statistical significance was observed between ChatGPT and Ophthalmologist B (p = 0.37). CONCLUSIONS: In this study, ChatGPT-4o demonstrated moderate diagnostic accuracy in identifying corneal pathology from slit-lamp photographs, with performance comparable to that of consultant ophthalmologists. These findings highlight the potential feasibility of using LLMs as adjunctive tools in ophthalmic image interpretation. Limitations include the AI model's tendency to produce confident yet occasionally inaccurate responses. While not yet suitable for autonomous diagnostic use, ChatGPT-4o shows promise as a supportive aid in clinical decision-making when used under appropriate expert supervision.