Abstract
Early diagnosis of glottic carcinoma is crucial for improving the therapeutic outcomes of patients. This study aims to develop a deep learning fusion network that integrates analysis of structured medical records with laryngoscopic images to enable early and accurate diagnosis of glottic carcinoma. The model was trained and validated on data from a tertiary hospital in China. External validation was subsequently conducted across another two independent medical centers. Monomodal reference models were also developed for comparative analysis. To benchmark clinical utility, a human-machine adversarial cohort was constructed to enable direct performance comparisons between the model and human raters. Diagnostic accuracy was quantified using the area under the receiver operating characteristic curve (AUC). The model achieved superior diagnostic performance compared to monomodal models and achieved performance comparable to senior otolaryngologists. VLMN holds significant potential to reduce diagnostic delays and improve patient prognosis, particularly for junior otolaryngologists or in medically underserved areas.