Abstract
Background/Objectives: In this study, a vision transformer (ViT) based ensemble architecture was developed for the classification of normal, benign, and malignant diseases from breast ultrasound images. The breast ultrasound images (BUSI) dataset was used for the implementation of the proposed method. This dataset includes 133 normal, 437 benign, and 210 malignant ultrasound images. Methods: ROI segmentation and image preprocessing were applied to the dataset to select only the tumor region and use it in the model. Thus, a better performance was achieved using only the lesion regions. Image augmentation was performed using the Albumentations library to increase the number of images. Feature extraction was performed on the obtained images using three ViT-based models (ViT-Base, DeiT, ViT-Small). The purpose of using three different models is to achieve high accuracy. The extracted features were classified using a multilayer perceptron (MLP). Training was performed using 10-fold stratified cross-validation. Results: The purpose of stratified cross-validation is to include a certain number of images from all three classes in each cross-validation proposed model provided 96.2% precision and 86.3% recall for the benign class and 92.9% recall and 76.4% precision for the malignant class. The normal class achieved 100% success. The area under the curve (AUC) values were 0.97, 0.96, and 1.00 for benign and malignant tumors, respectively, and 1.00 for normal tumors. Conclusions: The ROI-based ViT + MLP + Ensemble architecture provided higher accuracy and explainability compared to traditional convolutional neural network (CNN) based methods in medical image classification. It demonstrated a stable success, especially in minority classes, and presented a potential, reliable, and flexible solution in clinical decision support systems.