Abstract
PURPOSE: To develop a bilingual multimodal visual question answering (VQA) benchmark for evaluating Vision-language models (VLMs) in ophthalmology. METHODS: In this cross-sectional study, ophthalmic image posts and associated captions published between Jan 1, 2016, and Dec 31, 2024, were collected from WeChat Official Accounts. Based on these captions, bilingual question-answer (QA) pairs in Chinese and English were generated using GPT-4o-mini. QA pairs were categorized into six subsets by question type and language: binary (Binary_CN, Binary_EN), single-choice (Single-choice_CN, Single-choice_EN), and open-ended (Open-ended_CN, Open-ended_EN). The benchmark was used to evaluate six VLMs: GPT-4o, Gemini 2.0 Flash, Qwen2.5-VL-72B-Instruct, Janus-Pro-7B, InternVL3-8B, and HealthGPT-L14. Primary outcome was overall accuracy; secondary outcomes included subset-, subspeciality-, and modality-specific accuracy. Performance on open-ended questions were also quantified using language-based metrics, including AlignScore, BARTScore, BERTScore, BLEU, CIDEr, METEOR, and ROUGE_L. Error types in open-ended responses were manually analyzed through stratified sampling. RESULTS: OphthalWeChat included 3469 images and 30120 QA pairs cover 9 ophthalmic subspecialties, 548 conditions, 29 imaging modalities, and 68 modality combinations. Gemini 2.0 Flash achieved the highest overall accuracy (0.555), significantly outperforming GPT-4o (0.527), Qwen2.5-VL-72B-Instruct (0.520), HealthGPT-L14 (0.502), InternVL3-L14 (0.453), and Janus-Pro-7B (0.333) (all P < 0.001). It also led in both Chinese (0.551) and English subsets (0.559). By subset, Gemini 2.0 Flash excelled in Binary_CN (0.687) and Single-choice_CN (0.666); HealthGPT-L14 performed best in Single-choice_EN (0.739); while GPT-4o ranked highest in Binary_EN (0.717), Open-ended_CN (0.254), and Open-ended_EN (0.271). Language-based metrics showed inconsistent rankings relative to accuracy in open-ended subsets. Performance varied across subspecialties and modalities, with Gemini 2.0 Flash leading in 6 of 9 subspecialties and 11 of top-15 imaging modalities. Error types analysis revealed lesion/diagnosis errors as the most frequent (35.6%-50.6%), followed by anatomical location errors (28.3%-37.5%). CONCLUSIONS: This study presents the first bilingual VQA benchmark for ophthalmology, distinguished by its real-world context and inclusion of multiple examinations per patient. The dataset enables quantitative evaluation of VLMs, supporting the development of accurate and specialized AI systems for eye care.