Benchmarking large multimodal models for ophthalmic visual question answering with OphthalWeChat

利用 OphthalWeChat 对大型多模态眼科视觉问答模型进行基准测试

阅读:1

Abstract

PURPOSE: To develop a bilingual multimodal visual question answering (VQA) benchmark for evaluating Vision-language models (VLMs) in ophthalmology. METHODS: In this cross-sectional study, ophthalmic image posts and associated captions published between Jan 1, 2016, and Dec 31, 2024, were collected from WeChat Official Accounts. Based on these captions, bilingual question-answer (QA) pairs in Chinese and English were generated using GPT-4o-mini. QA pairs were categorized into six subsets by question type and language: binary (Binary_CN, Binary_EN), single-choice (Single-choice_CN, Single-choice_EN), and open-ended (Open-ended_CN, Open-ended_EN). The benchmark was used to evaluate six VLMs: GPT-4o, Gemini 2.0 Flash, Qwen2.5-VL-72B-Instruct, Janus-Pro-7B, InternVL3-8B, and HealthGPT-L14. Primary outcome was overall accuracy; secondary outcomes included subset-, subspeciality-, and modality-specific accuracy. Performance on open-ended questions were also quantified using language-based metrics, including AlignScore, BARTScore, BERTScore, BLEU, CIDEr, METEOR, and ROUGE_L. Error types in open-ended responses were manually analyzed through stratified sampling. RESULTS: OphthalWeChat included 3469 images and 30120 QA pairs cover 9 ophthalmic subspecialties, 548 conditions, 29 imaging modalities, and 68 modality combinations. Gemini 2.0 Flash achieved the highest overall accuracy (0.555), significantly outperforming GPT-4o (0.527), Qwen2.5-VL-72B-Instruct (0.520), HealthGPT-L14 (0.502), InternVL3-L14 (0.453), and Janus-Pro-7B (0.333) (all P < 0.001). It also led in both Chinese (0.551) and English subsets (0.559). By subset, Gemini 2.0 Flash excelled in Binary_CN (0.687) and Single-choice_CN (0.666); HealthGPT-L14 performed best in Single-choice_EN (0.739); while GPT-4o ranked highest in Binary_EN (0.717), Open-ended_CN (0.254), and Open-ended_EN (0.271). Language-based metrics showed inconsistent rankings relative to accuracy in open-ended subsets. Performance varied across subspecialties and modalities, with Gemini 2.0 Flash leading in 6 of 9 subspecialties and 11 of top-15 imaging modalities. Error types analysis revealed lesion/diagnosis errors as the most frequent (35.6%-50.6%), followed by anatomical location errors (28.3%-37.5%). CONCLUSIONS: This study presents the first bilingual VQA benchmark for ophthalmology, distinguished by its real-world context and inclusion of multiple examinations per patient. The dataset enables quantitative evaluation of VLMs, supporting the development of accurate and specialized AI systems for eye care.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。