Abstract
BACKGROUND: Deep-learning models designed to assist with clinical decision making abound in cardiology. However, the "black box" nature of these models limits physicians' ability to use them to cross-check clinical gestalt when evaluating model predictions. Analytical techniques such as the popular gradient-weighted class activation mapping (Grad-CAM) may provide insight into model explainability, but the reliability and reproducibility of these techniques have not been studied. OBJECTIVE: To perform a rigorous assessment of the explainability offered by Grad-CAM, with comparison to alternative saliency methods provided by intrinsicly explainable deep-learning models. METHODS: We examined a well-phenotyped cohort of 1930 patients with hypertrophic cardiomyopathy (HCM) and available electrocardiographic waveform data. Novel deep-learning models were developed for the prediction of 2 high-risk HCM features: left ventricular (LV) apical aneurysm and massive LV hypertrophy. Saliency analysis was performed using (1) Grad-CAM and (2) latent-space variable decoding (LSVD). RESULTS: Deep-learning models amenable to Grad-CAM- and LSVD-based saliency analysis demonstrated comparable performances in the identification of LV apical aneurysm (C statistic 0.95 vs 0.93) and massive LV hypertrophy (C statistic 0.82 vs 0.83) during holdout testing. However, while Grad-CAM produced highly variable visual assessments of model attention and offered little insight into the models' underlying decision-making processes, LSVD allowed direct visualization of those electrocardiographic characteristics that differentiated patients with and without the high-risk HCM features of interest. In addition, Kolmogorov-Smirnov goodness-of-fit testing of latent-space variables offered a method for prospectively assessing the likelihood of deep-learning model overfitting. CONCLUSION: Deep-learning models amenable to LSVD analysis offered more robust explainability than did models amenable to the popular Grad-CAM analytical technique while offering comparable predictive performance.