Abstract
Vision-Language models have shown remarkable performance for natural images and text. Given the homology of the anatomy, high gray-scale image dimension, and the unbalanced datasets, the traditional VLMs do not adapt well to radiological applications. In this work, we empirically adapted image encoder trained within domain-specific VLMs to be applied in two downstream tasks for 2D mammogram image analysis: tissue density estimation and BI-RADS prediction. We study the transfer learning behavior using linear probing, fine-tuning, and online self distillation. We analyze that knowledge driven domain-specific VLM backbones with frozen weights perform better than MammoClip VLM model as well as supervised baselines such as ViT and CNNs even with only 5% of training data. Generalization capabilities are further studied of these models on two external datasets.