Abstract
BACKGROUND: Early diagnosis of oral potentially malignant disorders (OPMDs) and oral cancer is crucial for reducing oral cancer incidence and mortality. With advancements in deep learning for oral image recognition, this study addressed the limitations of public datasets for oral mucosal disease, which are often restricted in size and insufficient in disease-type coverage. Through comparative analysis, a dual-stage multi-classification approach was proposed. METHODS: This study established and publicly released a high-quality oral mucosal disease dataset comprising 1,348 images, covering five categories: normal oral mucosa, oral leukoplakia, oral lichen planus, oral submucous fibrosis, and oral cancer. After image preprocessing, we trained and evaluated the models. Ten pre-trained models commonly used in oral image recognition (DenseNet-169, EfficientNet-B0, HRNet-W18-C, Inception-V4, MixNet-S, MobileNetV3-Large, ResNet-101, Swin Transformer, ViT-B, and YOLOv11l) were trained using two classification pathways: single-stage and dual-stage. Model performance was evaluated using metrics including accuracy, precision, recall, F1-score, area under the curve (AUC), and confusion matrix. RESULTS: The dual-stage classification model based on Swin Transformer and DenseNet-169 achieved an accuracy of 0.9029, precision of 0.9082, recall of 0.8995, F1-score of 0.9032, and an AUC of 0.9735. Among the single-stage classification models, the best-performing EfficientNet-B0 model achieved an accuracy of 0.8710, precision of 0.8715, recall of 0.8719, F1-score of 0.8698, and an AUC of 0.9766. Based on these metrics, the dual-stage classification model demonstrated superior performance compared to the single-stage classification model. CONCLUSIONS: This study established a publicly available high-quality dataset of 1,348 oral mucosal disease images. Furthermore, it proposed a novel dual-stage classification model integrating Swin Transformer and DenseNet-169, which was demonstrated to outperform conventional single-stage classification model across key performance metrics such as accuracy and precision.