Abstract
Advances in artificial intelligence (AI) have significantly improved medical diagnosis, with deep learning models achieving expert-level performance across unimodal tasks such as medical imaging, physiological signal analysis, electronic health record (EHR) modeling, and omics-based prediction. However, clinical decision-making is inherently multimodal, as diseases manifest through complex interactions among imaging phenotypes, molecular signatures, physiological measurements, and textual clinical documentation. Consequently, unimodal systems often lack robustness, generalizability, and clinical reliability. This survey provides a comprehensive and methodologically grounded review of multimodal learning for disease diagnosis, emphasizing the paradigm shifts that have emerged over the past five years. Beyond classical early, intermediate, and late fusion strategies, we synthesize modern cross-modal representation learning frameworks, including contrastive alignment, vision-language pretraining, graph and hypergraph-based multimodal reasoning, modality-agnostic representation learning, and missing-modality robust architectures. We further examine large-scale foundation-model style multimodal pretraining and recent advances in histology-transcriptomics and image-omics integration, which exemplify biologically grounded cross-modal learning beyond traditional fusion pipelines. In addition to summarizing widely used datasets and clinical applications across oncology, neurology, cardiology, pulmonology, and ophthalmology, we provide a methodological synthesis linking key challenges such as modality heterogeneity, incomplete data, fairness disparities, interpretability limitations, and cross-institutional distribution shift to representative solution frameworks proposed in the literature. By integrating theoretical formulations, architectural insights, and application-driven evidence, this survey moves beyond case-oriented performance comparisons and offers a structured perspective on how multimodal AI is evolving toward scalable, robust, and clinically trustworthy diagnostic systems.