Abstract
Early and reliable detection of breast cancer across imaging modalities remains a long-standing challenge due to the heterogeneous appearance of lesions and the lack of cross-domain consistency among medical imaging systems. Recent advances in Vision Transformers (ViTs) and parameter-efficient fine-tuning (PEFT) techniques have enabled rapid model adaptation, yet most existing approaches remain data-driven and fail to incorporate domain-specific anatomical priors. In this work, we propose A-VPT (Anatomy-Guided Visual Prompt Tuning), a novel framework that integrates explicit anatomical structure into the prompt space of a frozen ViT backbone. Unlike conventional prompt tuning methods, A-VPT dynamically generates tissue-aware prompts guided by glandular, fatty, and ductal region embeddings, and performs hierarchical prompt-token interaction across transformer layers. Furthermore, a cross-modal contrastive alignment strategy harmonizes anatomical semantics among mammography, ultrasound, and MRI, enabling robust multi-domain generalization. Extensive experiments on three benchmark datasets (INbreast, BUSI, and Duke-Breast-MRI) demonstrate that A-VPT achieves state-of-the-art performance in both lesion classification and segmentation while using less than 2% of the tunable parameters required for full fine-tuning. Qualitative analyses confirm that anatomy-guided prompts yield interpretable attention patterns consistent with radiological structures. Our results suggest that embedding anatomical priors into prompt tuning not only enhances efficiency and generalization but also provides an interpretable bridge between deep learning representations and human anatomical reasoning.