Abstract
Over the past decade, machine learning has seen a dramatic increase in application, especially in computer vision, where advanced convolutional deep neural networks have achieved, and in some cases exceeded, human-level performance. However, this rise in black-box models has heightened the demand for transparency. In response, the field of explainable artificial intelligence has developed various techniques to explain model predictions. Saliency maps, in particular, have become popular for analyzing image data. Despite their widespread use, evaluating these techniques remains challenging due to the complex and multi-dimensional nature of explainability. In this study, we conduct a comprehensive comparative evaluation of six widely used saliency map explainability techniques: LIME, SHAP, GradCAM, GradCAM++, Integrated Gradients (IntGrad), and SmoothGrad. Our evaluation uses five quantitative, function-grounded metrics-fidelity, stability, identity, separability, and computational time-each addressing a different aspect of interpretability. We apply these metrics across three benchmark datasets and three well-established deep learning architectures to assess the strengths and limitations of each technique. Our empirical analysis shows that no single XAI method excels across all evaluation metrics. Gradient-based methods-especially Integrated Gradients and SmoothGrad-achieved the best fidelity and stability scores, with statistically significant improvements over LIME, GradCAM, and GradCAM++ on CIFAR10 and Imagenette. SHAP also demonstrated strong performance, particularly on the SVHN dataset. Identity and separability were perfect for all methods except LIME and SmoothGrad. GradCAM and GradCAM++ offered the highest computational efficiency, though at the cost of lower fidelity. These results highlight trade-offs between explanation quality and efficiency, and confirm statistically robust differences in method performance across tasks.