Abstract
OBJECTIVE: This study aims to explore the predictive value of breast architectural distortion, benign and malignant, based on digital breast tomosynthesis (DBT) habitat imaging combined with various machine learning algorithms. METHODS: This retrospective study included 254 architectural distortion lesions from two medical centers between January 2019 to July 2023. The data from the first center were divided into training and validation sets at a ratio of 7:3; the second center served as an external test set. Breast DBT scans of patients were collected. The lesions were delineated layer by layer using ITK-SNAP software, and radiomics features were extracted based on PyRadiomics. Subsequently, Z-score normalization was applied to standardize the features to ensure similar scales and variances. The Bayesian Information Criterion (BIC) was first used to determine the optimal number of clusters, followed by clustering analysis using the Gaussian Mixture Model (GMM) to generate different tumor sub-regions. Feature extraction was then performed for each independent habitat sub-region to obtain habitat imaging features. For these habitat features, a series of processing steps were carried out: first, all features were standardized; next, dimensionality reduction was performed on the training set using hypothesis testing and Least Absolute Shrinkage and Selection Operator (LASSO) to obtain the optimal feature subset. Finally, various machine learning algorithms were employed to construct different radiomics models, which were validated in the internal validation set and external test set. Model evaluation was conducted using the Receiver Operating Characteristic Curve (ROC) and Confusion Matrix RESULTS: After sample allocation, the training set comprised 112 subjects; the internal validation set included 47 individuals; and the external test set contained 95 people. A total of 2,260 habitat imaging features were extracted. Hypothesis testing and LASSO dimensionality reduction were applied, resulting in 19 optimal features for constructing various machine learning models. Among the compared models, logistic regression performed best, with the Area Under the Curve (AUC) values in the training set, internal validation set, and external test set being 0.868, 0.739, and 0.665, respectively. CONCLUSION: This study demonstrates that habitat imaging based on DBT shows promising discriminative value in distinguishing benign from malignant breast architectural distortion. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12880-025-01987-5.