Abstract
BACKGROUND: Neoadjuvant therapy (NAC) is a standard treatment for breast cancer, yet only some patients gain significant benefit. Identifying those most likely to benefit from NAC is crucial. Single-modality data often overlook patient heterogeneity, so we developed an interpretable, attention-based multimodal full information feature fusion transformer, MuFi, to predict NAC responses by integrating whole slide images (WSI) and magnetic resonance imaging (MRI). METHODS: Data from 567 biopsy-confirmed breast cancer patients from two institutions were retrospectively analyzed, with a training cohort (n = 290), validation cohort (n = 73), and external test cohort (n = 204). Multimodal data included pre-treatment pathology slides, MRI scans, and clinical information. A memory-efficient multimodal model was used to fuse WSIs and MRI, with a transformer capturing interactions between histological patches and MRI features. RESULTS: MuFi achieved AUCs of 81.9% and 78.5% in discovery and validation cohorts and 79.3% in external testing, outperforming clinical, single-modality and late-fusion-based models. Integrating clinical data (cT and molecular subtype) with MuFi and Feature Re-calibration based Multiple Instance Learning (FRMIL) models further increased AUCs to 90.2%, 81.8%, and 81.6% across the cohorts, indicating enhanced predictive accuracy and generalizability, especially in external testing. CONCLUSION: By fusing pathology and radiology features, MuFi improves decision reliability and identifies critical multimodal predictors. This integration framework better captures patient heterogeneity, supporting personalized NAC decision-making through improved accuracy and generalizability.