Abstract
IMPORTANCE: There is an increasing amount of literature evaluating the clinical knowledge and reasoning performance of large language models (LLMs) in ophthalmology, but to date, investigations into its multimodal abilities clinically-such as interpreting images and tables-have been limited. OBJECTIVE: To evaluate the multimodal performance of the following 7 foundation models (FMs): GPT-4o (OpenAI), Gemini 1.5 Pro (Google), Claude 3.5 Sonnet (Anthropic), Llama-3.2-11B (Meta), DeepSeek V3 (High-Flyer), Qwen2.5-Max (Alibaba Cloud), and Qwen2.5-VL-72B (Alibaba Cloud) in answering offline Fellowship of the Royal College of Ophthalmologists part 2 written multiple-choice textual and multimodal questions, with head-to-head comparisons with physicians. DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study was conducted between September 2024 and March 2025 using questions sourced from a textbook used as an examination preparation resource for the Fellowship of the Royal College of Ophthalmologists part 2 written examination. EXPOSURE: FM performance. MAIN OUTCOMES AND MEASURES: The primary outcome measure was FM accuracy, defined as the proportion of answers generated by the model matching the textbook's labeled letter answer. RESULTS: For textual questions, Claude 3.5 Sonnet (accuracy, 77.7%) outperformed all other FMs (followed by GPT-4o [accuracy, 69.9%], Qwen2.5-Max [accuracy, 69.3%], DeepSeek V3 [accuracy, 63.2%], Gemini Advanced [accuracy, 62.6%], Qwen2.5-VL-72B [accuracy, 58.3%], and Llama-3.2-11B [accuracy, 50.7%]), ophthalmology trainees (difference, 9.0%; 95% CI, 2.4%-15.6%; P = .01) and junior physicians (difference, 35.2%; 95% CI, 28.3%-41.9%; P < .001), with comparable performance with expert ophthalmologists (difference, 1.3%; 95% CI, -5.1% to 7.4%; P = .72). GPT-4o (accuracy, 69.9%) outperformed GPT-4 (OpenAI; difference, 8.5%; 95% CI, 1.1%-15.8%; P = .02) and GPT-3.5 (OpenAI; difference, 21.8%; 95% CI, 14.3%-29.2%; P < .001). For multimodal questions, GPT-4o (accuracy, 57.5%) outperformed all other FMs (Claude 3.5 Sonnet [accuracy, 47.5%], Qwen2.5-VL-72B [accuracy, 45%], Gemini Advanced [accuracy, 35%], and Llama-3.2-11B [accuracy, 25%]) and the junior physician (difference, 15%; 95% CI, -6.7% to 36.7%; P = .18) but was weaker than expert ophthalmologists (accuracy range, 70.0%-85.0%; P = .16) and trainees (accuracy range, 62.5%-80%; P = .35). CONCLUSIONS AND RELEVANCE: Results of this cross-sectional study suggest that for textual questions, current FMs exhibited notable improvements in ophthalmological knowledge reasoning when compared with older LLMs and ophthalmology trainees, with performance comparable with that of expert ophthalmologists. These models demonstrated potential for medical assistance for answering ophthalmological textual queries, but their multimodal abilities remain limited. Further research or fine-tuning models with diverse ophthalmic multimodal data may lead to more capable applications with multimodal functionalities.