Abstract
Background/Objective: The global spread of Monkeypox (Mpox) has highlighted the urgent need for rapid, accurate diagnostic tools. Traditional methods like polymerase chain reaction (PCR) are resource-intensive, while skin image-based detection offers a promising alternative. This study evaluates the effectiveness of vision transformers (ViTs) for automated Mpox detection. Methods: By fine-tuning a pre-trained ViT model on an Mpox lesion image dataset, a robust ViT-based transfer learning (TL) model was created. Performance was assessed relative to convolutional neural network (CNN)-based TL models and ViT models trained from scratch across key metrics: accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). Furthermore, a transferability measure was utilized to assess the effectiveness of feature transfer to Mpox images. Results: The results show that the ViT model outperformed a CNN, achieving an AUC of 0.948 and an accuracy of 0.942 with a p-value of less than 0.05 across all metrics, highlighting its potential for accurate and scalable Mpox detection. Moreover, the ViT models yielded a better hypothesis margin-based transferability measure, highlighting its effectiveness in transferring useful learning weights to Mpox images. Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations also confirmed that the ViT model attends to clinically relevant features, supporting its interpretability and reliability for diagnostic use. Conclusions: The results from this study suggest that ViT offers superior accuracy, making it a valuable tool for Mpox early detection in field settings, especially where conventional diagnostics are limited. This approach could support faster outbreak response and improved resource allocation in public health systems.