Abstract
Background/Objectives: Colorectal cancer (CRC) is a leading cause of cancer deaths worldwide, underscoring the need for diagnostic tools that early, accurate, and clinically interpretable. Current artificial intelligence (AI) models are predominantly unimodal and lack sufficient interpretability, which restricts their clinical adoption. Methods: We propose IDF-Net, an interpretable dynamic fusion framework that integrates endoscopy, computed tomography (CT), and histopathology using modality-specific encoders, a dual-stage adaptive gating mechanism, and cross-modal attention. We conducted stratified 5-fold cross-validation and assessed interpretability using spatial heatmaps and modality attribution. We also quantified the results using the intersection-over-union metric for saliency alignment. Results: IDF-Net achieved a state-of-the-art accuracy of 0.920 (0.907-0.936) and area under the curve (AUC) of 0.991 (95% CI: 0.965-0.997), significantly outperforming unimodal and static-fusion baselines (p < 0.05). Interpretability analysis of IDF-Net demonstrated a strong alignment between Gradient-weighted Class Activation Mapping++ heatmaps and expert-annotated lesions, as well as case-specific modality contributions via SHapley Additive exPlanations values. Ablation studies confirmed the contribution of each component, with dynamic routing and cross-attention fusion improving AUC by 0.038 and 0.046, respectively. Conclusions: IDF-Net introduces a dynamically fused, multimodal diagnostic framework with integrated quantitative interpretability, demonstrating superior accuracy and strong potential for clinical translation in CRC diagnosis. The model's adaptive design allows it to function robustly even when CT data is unavailable, aligning with common clinical pathways while leveraging additional imaging when present for comprehensive staging.