Abstract
Although Deep learning (DL) methodologies have progressed substantially, the classification of skin cancer remains a challenging task. There are several reasons for this. Some of them are: artifacts like Hair; severe class imbalance in dermatoscopic datasets; and difficulty in extracting both fine-grained local features (details within small area of lesion) like texture, color, pigment network, vascular patterns and long range global features like overall shape, border irregularity, asymmetry. To overcome these, this study presents a novel, two-stage framework. At first, C’GAN (Conditional Generative Adversarial Network) is employed for generation of duplicate images for the minority classes. Then secondly, a CNN-ViT ensemble architecture is introduced followed by a cross attention based fusion module to fuse their features. The attention fusion model synergistically merges ViT’s global token representations with CNN’s local feature maps. The overall performance is analyzed through some standard quantitative metrics, whereas the reliability as well as the stability are validated through bootstrap based statistical analysis. The framework achieved remarkable accuracies of 99.3%, 99.7%, 98.9% and 98.2% on Dermatofibroma, Vascular lesions, Basal Cell Carcinoma, and Actinic Keratosis respectively, besides 99.4% overall AUC, 0.93 bootstrap mean, and 0.0003 standard error. The proposed model showed balanced performance both on majority as well as on minority classes, showcasing it’s effectiveness in class imbalance.