Abstract
BACKGROUND: Cardiovascular diseases remain the leading global cause of mortality, yet traditional electrocardiogram (ECG) interpretation shows subjective variability and limited sensitivity to complex pathologies. OBJECTIVE: This study aims to address these challenges by proposing the Cardiovascular Multimodal Prediction Network (CaMPNet), a transformer-based multimodal architecture that integrates raw 12-lead ECG waveforms, 9-structured machine-measured ECG features, and demographic data (age and sex) through cross-attention fusion. METHODS: The model was trained on 384,877 records from the Medical Information Mart for Intensive Care IV - Electrocardiogram Matched Subset database and evaluated across 12 cardiovascular disease labels. To further assess temporal robustness, a temporal external validation was performed using the most recent 10% of the data, withheld chronologically from model development. RESULTS: On the internal test set, the model achieved a mean area under the curve (AUC) of 0.845 (SD 0.04) and area under the precision-recall curve of 0.489, outperforming the residual networks-ECG baseline (AUC=0.848 but F1-score=0.152) and all single-modality variants. Subgroup analyses demonstrated consistent performance across demographics (male AUC= 0.846 vs female=0.843; youngest quartile 0.884 vs oldest 0.811). CaMPNet retained moderate discriminative ability in temporal external validation with a mean AUC of 0.715 (SD 0.03) and area under the precision-recall curve of 0.298, although performance declined due to temporal distribution shifts. Despite this, major disease categories, such as atrial fibrillation, heart failure, and normal rhythm, maintained high AUCs (>0.84). Attention-based visualization revealed clinically interpretable patterns (eg, ST-segment elevations in ST-segment elevation myocardial infarction), and ablation experiments verified the model's tolerance to missing structured inputs. CONCLUSIONS: CaMPNet demonstrates robust and interpretable multimodal ECG-based diagnosis, offering a scalable framework for comorbidity screening and continual learning under real-world temporal dynamics.