Abstract
Background: This study investigates the impact of anatomically constrained preprocessing and deep learning architecture selection on benign versus malignant breast lesion classification in contrast-enhanced mammography (CEM), with the goal of improving robustness and clinical reliability across heterogeneous data sources. Methods: In this retrospective multicenter study, CEM images from 300 patients (314 lesions) were combined with 1003 publicly available CEM images, yielding a total of 1120 breast cases. Automatic breast segmentation was performed using the LIBRA framework to generate breast-mask images. Eleven deep learning models, including classical convolutional neural networks, attention-based networks, hybrid convolutional neural networks (CNNs), Transformer architectures, and mammography-specific models, were trained and evaluated using both original DICOM images and breast-mask inputs. Performance was assessed using accuracy, balanced accuracy, sensitivity, specificity, AUROC, and AUPRC on cross-validation and independent test sets. Hyperparameter optimization was conducted for the best-performing architecture. Results: Models trained on breast-mask images consistently outperformed those trained on original DICOM images across all architectures and metrics, with AUROC improvements ranging from +0.06 to +0.21. Among all models, ResNet50 trained on breast-mask images achieved the best performance (AUROC = 0.931; AUPRC = 0.933; balanced accuracy = 0.834), further improved after optimization (balanced accuracy = 0.886; sensitivity = 0.842; specificity = 0.930). Classical CNN architectures demonstrated performance comparable to or exceeding that of more complex hybrid CNN–Transformer models when anatomically focused preprocessing and rigorous optimization were applied. Conclusions: Anatomically constrained preprocessing through breast-mask segmentation substantially enhances deep learning performance and stability in CEM-based breast lesion classification. These findings indicate that input representation quality and training optimization are critical determinants of clinically relevant performance, often outweighing architectural complexity, and may support more reliable AI-assisted decision support in CEM workflows.