Abstract
Background/Objectives: Monkeypox is a zoonotic virus that presents with smallpox-like symptoms, making visual diagnosis challenging due to overlap with other dermatological conditions. Existing AI-based studies on monkeypox classification have largely relied on Convolutional Neural Networks (CNNs), with limited exploration of Transformer architectures or robust interpretability frameworks. Moreover, most explainability research still depends on conventional heatmap techniques without systematic evaluation. This study addresses these gaps by applying Transformer-based models and introducing a novel hybrid explainability approach. Methods: We fine-tuned Vision Transformer (ViT) and Data-Efficient Image Transformer (DeiT) models for both binary and multi-class classification of monkeypox and other skin lesions. To improve interpretability, we integrated multiple explainable AI techniques-Gradient-weighted Class Activation Mapping (Grad-CAM), Layer-wise Relevance Propagation (LRP), and Attention Rollout (AR)-and proposed a hybrid method that combines these heatmaps using Principal Component Analysis (PCA). The reliability of explanations was quantitatively assessed using deletion and insertion metrics. Results: ViT achieved superior performance with an AUC of 0.9192 in binary classification and 0.9784 in multi-class tasks, outperforming DeiT. The hybrid approach (Grad-CAM + LRP) produced the most informative explanations, achieving higher insertion scores and lower deletion scores than individual methods, thereby enhancing clinical reliability. Conclusions: This study is among the first to combine Transformer models with systematically evaluated hybrid explainability techniques for monkeypox classification. By improving both predictive performance and interpretability, our framework contributes to more transparent and clinically relevant AI applications in dermatology. Future work should expand datasets and integrate clinical metadata to further improve generalizability.