Abstract
BACKGROUND: Lung cancer, a highly lethal malignant disease, requires timely and accurate differentiation between benign and malignant pulmonary nodules (PNs) to enable early intervention and improved prognosis. Positron emission tomography/computed tomography (PET/CT) is a multimodal imaging technique that integrates metabolic information with anatomical details, playing a crucial role in tumor diagnosis. This study aimed to develop a multimodal fusion-based classification model for the automated diagnosis of PNs, ultimately supporting clinical decision-making. METHODS: We propose a novel multi-level cross-modal fusion classification framework, of which the core architecture comprises: (I) a dual-path densely connected network for hierarchically extracting modality-specific features; and (II) a multi-level cross-modal interaction mechanism to facilitate complementary feature fusion. This end-to-end framework performs a comprehensive diagnostic categorization of PNs, effectively distinguishing between benign and malignant cases, thereby enhancing the efficiency and accuracy of clinical decision-making. RESULTS: The proposed model was evaluated on a real-world clinical dataset. The experimental results demonstrate that it achieved an accuracy of 0.7778, a precision of 0.7590, a recall of 0.7968, and an F1 score of 0.7725. CONCLUSIONS: The proposed model outperforms state-of-the-art baselines, validating the effectiveness of its feature extraction and multi-level cross-modal interaction strategy. These findings highlight the potential of the proposed model as a robust and reliable tool in clinical settings, capable of supporting intelligent, automated diagnosis of PNs.