Abstract
Cross-modal reasoning tasks face persistent challenges such as cross-modal inference of causal dependencies with coarse-grained, weak resistance to noise, and weak interaction of spatial-temporal features. To address these issues, the article proposes a dynamic causal-aware collaborative quantum state evolution multimodal reasoning architecture, Causal-aware Dynamic Multimodal Reasoning Network (CDMRNet). The innovation of the model is reflected in the design of the following three-stage progressive linkage architecture of dynamic causal discovery-quantum state fusion-meta-adaptive reasoning: (1) causal discovery module based on differentiable directed acyclic graphs (DAGs) is used to dynamically identify causal structures between modes, thus solving the problem of coarse dependency granularity; (2) fusion modules inspired by quantum entanglement utilize controlled phase gates to enhance semantic coherence between modalities in Hilbert space, leading to enhanced environmental robustness; (3) meta-adaptive inference mechanism achieves zero-sample adaptation and enhances multi-scale memory to improve the spatio-temporal feature interaction accuracy of the model. To evaluate its performance, the study conducts extensive experiments across three datasets: Visual Genome, MIMIC-CXR, and nuScenes. CDMRNet achieves 89.7% accuracy on Visual Genome, improves F1 score to 84.1%, and shows 3.9% performance drop only under modal absence, significantly outperforming state-of-the-art models. Ablation studies confirm the critical role of each module, particularly the quantum state fusion which contributes to a QED score of 73.0%, evidencing effective cross-modal entanglement. These results validate that CDMRNet not only strengthens causal reasoning, but also improves robustness and generalization in quantum-inspired multimodal systems.