Abstract
PURPOSE: Vision-language models (VLMs) are increasingly used to interpret multimodal educational materials, yet their reliability on diagram-, equation-, and text-dense scientific lecture slides remains poorly understood. This work introduces Medical Imaging Lecture Understanding (MILU), a large-scale benchmark designed to characterize cross-model variability in structured understanding of real medical imaging lectures. APPROACH: MILU includes 23 lecture sets with 1117 slides. LLaVA-OneVision, InternVL3-14B, Qwen2-VL-7B, and Qwen3-VL-4B were evaluated using unified prompts to generate structured JSON. We assessed parsing coverage, pairwise agreement, lecture-level patterns, and how outputs aligned with a simple consensus ensemble to identify shared concepts and relations across slides and models effectively. RESULTS: All models produced valid JSON for most slides (92% to 99% coverage), but semantic agreement was extremely low. Pairwise concept Jaccard indices ranged from 0.03 to 0.09, and triple-level F1 scores from 0.001 to 0.033. Lecture-level patterns revealed higher stability in mathematically structured lectures and lower stability in diagram-heavy content. The consensus ensemble showed modest alignment with individual models (concept Jaccard 0.056 to 0.179; triple F1 0.014 to 0.044), exposing areas of consistent convergence while also highlighting systematic disagreement. CONCLUSIONS: MILU provides the first comprehensive benchmark for evaluating structured understanding of scientific lecture slides. The results show that current VLMs achieve high formatting reliability but low semantic consistency. MILU establishes a foundation for future expert-annotated benchmarks, diagram- and math-aware modeling, and improved methods for scientific lecture interpretation.