Abstract
Cranial imaging diagnosis analyzes lesions in cranial CT images, automatically determining information such as lesion type, providing physicians with diagnostic recommendations, and saving valuable time during urgent medical events like stroke. Most current mainstream diagnostic generation methods rely on single-modal data for analysis, such as imaging reports or a single CT slice. However, relying on a single CT slice leads to loss of lesion information which across different CT slices. In addition, imaging reports often contain only coarse descriptions of abnormal regions without detailed pixel-level features, resulting in insufficient lesion characterization. Moreover, some works directly use 3D CNN to process entire 3D CT scans, but using only 3D CNN makes it challenging to accurately identify very small lesions. Therefore, this paper proposes a multimodal diagnostic model called MM-CD (MultiModal Craniocerebral Diagnose), which integrates imaging reports and cranial 3D CT findings for joint diagnosis. Specifically, this study first uses a 2D image pretrained model in combination with a vertical-dimension weight generation module for the cranial 3D CT images, enabling the model to focus on abnormal CT slices. Next, a multi-scale image fusion module is designed to effectively consolidates lesion descriptions from multiple CT slices into a single slice. Followed by a self-attention mechanism, this paper integrates CT information with the imaging report to construct a more comprehensive diagnostic reference. This clinically oriented design aims to lower missed-diagnosis rates for small, spatially sparse lesions and to shorten door-to-treatment intervals in acute care. Experimental results on a real clinical dataset show that the method improves overall accuracy by 1.65% compared to existing state-of-the-art medical multimodal models.