Abstract
BACKGROUND: Breast cancer remains the most commonly diagnosed malignancy among women worldwide. Histopathological image analysis is the clinical gold standard for diagnosis; however, the high resolution and complexity of these images, together with limited annotated data, pose significant challenges for traditional deep learning methods. This study aims to develop a robust classification framework capable of effectively analyzing high-resolution histopathological images. METHODS: We propose ResViT-GANNet, a novel dual-branch deep learning architecture that integrates a residual convolutional network with channel attention and a vision transformer with multi-layer token fusion. This design is specifically intended to capture both fine-grained local pathological features and long-range global semantic representations. A key novelty of our framework is the Token-Aligned Multimodal Attention (TAMA) module, which combines heterogeneous features from both branches through multi-head attention and token-wise alignment. To address limited and imbalanced data, we incorporated synthetic histopathology images generated with StyleGAN2-ADA into the training set. Extensive experiments on the BACH and BreakHis datasets demonstrate superior performance, with statistical significance confirmed through rigorous evaluation. RESULTS: On the BACH dataset (4-class classification), ResViT-GANNet achieved an accuracy of 96.40%, precision of 96.34%, recall of 96.36%, and an F1-score of 96.35%. These results significantly outperformed baseline methods including TransMIL (85.83%), CTransPath (88.75%), and SwinCNN (92.89%), with p-values < 0.01 and large effect sizes (Cohen’s d > 1.0). Incorporating synthetic data yielded an average accuracy improvement of 3.3%. On the BreakHis dataset (8-class classification across four magnification levels), the model attained an average accuracy of 98.22%, with per-class accuracies ranging from 97.25% to 99.50%. Grad-CAM visualizations further confirmed enhanced interpretability and highlighted critical histological features relevant for classification. CONCLUSIONS: ResViT-GANNet substantially improves classification performance on complex, high-resolution histopathology images. The major contributions of this work include a parallel dual-branch architecture enabling synergistic local–global feature learning, a token-aligned multimodal fusion mechanism, and the integration of generative augmentation with explainable AI. Together, these innovations enhance model generalization and robustness, underscoring the potential of ResViT-GANNet as a clinically useful decision-support system for breast cancer diagnosis. TRIAL REGISTRATION: Not applicable.