Abstract
Skin cancer is among the most widely distributed, deadliest cancers around the globe, and early diagnosis becomes vital to enhance patient survival. Deep learning has demonstrated high potential for automatic skin lesion classification. However, existing Convolutional Neural Networks (CNNs) are still unsatisfactory in terms of dataset reliance, subjection to orientations, and the inability to model long-range global context. To solve this problem, we propose a hybrid model named the VGG19-RSPDA-ViT with the fine-grained local feature captured by VGG19 and the global context provided by Vision Transformers (ViT). The proposed RSPDA enforces rotation invariance and enriches the feature space, which further strengthens generalization on a small training set. To the best of our knowledge, this is the first work to systematically combine feature-map-level rotation/shift augmentation with a CNN-Transformer hybrid model for dermoscopic skin cancer detection. Performance was validated on three benchmark datasets: the Melanoma Skin Cancer Dataset of 10,000 Images (MSK10000) for binary classification and the Human Against Machine with 10,000 training images (HAM10000) and Hospital Pedro Hispano (PH2) for multi-class classification. Our model achieved accuracies of 97.9%, 97.1%, and 98.67% on MSK10000, HAM10000, and PH2 datasets, respectively, with consistently high macro-averaged precision, recall, specificity, and F1 scores across both datasets. VGG19-RSPDA-ViT outperformed existing state-of-the-art methods with superior generalization capabilities. These results demonstrate that our proposed model is effective for skin lesion classification and has significant potential for clinical application as an automated diagnostic tool in dermatology.