Abstract
BACKGROUND: Pretrained foundation models are increasingly adopted for diabetic retinopathy (DR) screening, yet it remains unclear how much of their performance derives from the learned representations versus the adaptation procedure. Most benchmarks report discrimination metrics alone, neglecting probability calibration. METHODS: We compared the frozen representations of three pretrained encoders: MedSigLIP (medical vision–language; ViT-B/16, 448 × 448), RETFound (retinal self-supervised; ViT-L/16, 224 × 224), and EfficientNet-B0 (ImageNet-supervised; 224 × 224). All encoder weights were frozen; only an identical lightweight multilayer perceptron head was trained. Models were developed on APTOS 2019 (3,662 fundus images; five-fold cross-validation) and externally validated on MESSIDOR-2 (1,744 images). Binary referable DR detection and five-class severity grading were evaluated. AUC, expected calibration error (ECE), and Brier score served as co-primary endpoints. External-set tests used patient-level cluster-robust bootstrap to account for bilateral correlation. RESULTS: On the development set, all three encoders achieved near-identical binary AUC (0.980–0.985). MedSigLIP showed superior calibration, with a lower Brier score than RETFound (0.044 vs. 0.049; p = 0.030) and EfficientNet-B0 (0.044 vs. 0.052; p = 0.006). External validation on MESSIDOR-2 revealed divergence: MedSigLIP maintained an AUC of 0.915 (drop 0.070), whereas RETFound fell to 0.697 (drop 0.286) and EfficientNet-B0 to 0.745 (drop 0.236). Retina-specific RETFound performed below the ImageNet baseline (ΔAUC = −0.051; p = 0.016, cluster-robust bootstrap). For five-class grading, MedSigLIP achieved an external macro-F1 of 0.450 versus 0.247 (RETFound) and 0.291 (EfficientNet-B0). Temperature scaling reduced development ECE to 0.014–0.022 but proved ineffective under domain shift (external ECE 0.086–0.149). All encoders exhibited catastrophic failure on mild DR (grade 1) externally, with RETFound and EfficientNet-B0 achieving F1 = 0.000 and MedSigLIP only 0.153. CONCLUSION: Under frozen transfer, the MedSigLIP encoder package produced more generalisable and better calibrated representations than both retinal self-supervised (RETFound) and ImageNet-supervised (EfficientNet-B0) encoders. Domain-specific pretraining did not guarantee domain-general frozen representations. These findings demonstrate that development-set discrimination alone is insufficient for encoder evaluation and that calibration metrics—particularly the Brier score—should be reported as standard practice.