Abstract
Breast cancer remains the most prevalent malignancy in women worldwide. Mammography-based early detection plays a pivotal role in improving patient survival outcomes. While large vision-language models offer transformative potential for mammogram visual question answering, the absence of standardized evaluation benchmarks currently makes it hard to fairly compare different large vision-language models' performance in mammogram interpretation. In this study, we address this critical gap through three key contributions: (1) We introduce MammoVQA, a mammogram visual question-answering dataset that unifies 15 public datasets, comprising 131,847 images (421K question-answering pairs) for image-level cases and 72,518 exams (476K images, 144K question-answering pairs) for exam-level cases. (2) Systematic evaluation of 12 recent high-performance large vision-language models (6 general, 6 medical) reveals diagnostic performance statistically equivalent to random guessing, highlighting their unreliability for mammogram interpretation. (3) Our domain-optimized LLaVA-Mammo achieves average +19.66% weighted accuracy gains over the best recent high-performance model in internal validation, with average +21.21% weighted accuracy improvements in external validation.