Abstract
Background/Objectives: Multimodal data fusion is increasingly applied in neuroinformatics to integrate heterogeneous sources of information. However, the optimal strategies for combining modalities with markedly different dimensionality, scale, and noise characteristics remain unclear. To our knowledge, this is among the first systematic and controlled benchmarks explicitly disentangling the effects of fusion strategy and feature scaling within a unified deep learning framework. Methods: Using data from 747 healthy participants from the Human Connectome Project, we evaluated multiple fusion paradigms-including early fusion, attention-based fusion, subspace-based fusion, and graph-based fusion-within a unified and reproducible framework. Importantly, we assessed how different feature scaling techniques (Standard, Min-Max, and Robust scaling) interact with fusion strategies and influence model performance. Biological sex was used as a controlled benchmark task to focus on methodological insights rather than task-specific optimization. Results: Early feature-level fusion consistently achieved the highest classification performance across all evaluated configurations. In particular, direct concatenation of cognitive and neuroimaging features combined with Standard Scaling yielded the best results (AUC-ROC = 0.96 (0.95-0.96)), outperforming unimodal baselines as well as intermediate and late fusion strategies. Conclusions: This systematic benchmark demonstrates that multimodal deep learning performance in neuroscience is driven primarily by the interaction between fusion strategy and feature scaling rather than by architectural complexity alone. By explicitly disentangling the effects of fusion level and preprocessing within a unified framework, this study provides practical methodological guidance for the design, evaluation, and reproducible deployment of multimodal deep learning models in neuroscience.