Abstract
INTRODUCTION: This study aimed to develop and evaluate an artificial intelligence pipeline combining object detection and classification models to assist in early identification and differentiation of oral diseases. METHODS: This retrospective cross-sectional study utilized clinical images of oral potentially malignant disorders and oral squamous cell carcinoma, comprising a baseline dataset of 773 images from Faculdade de Odontologia de Piracicaba, Universidade Estadual de Campinas (FOP-UNICAMP) and an external validation dataset of 132 images from Federal University of Paraíba (UFPB). All images were obtained prior to biopsy, all with corresponding histopathological reports. For object detection, ten YOLOv11 models were developed with varying data augmentation strategies, trained for 200 epochs using pretrained COCO weights. For classification, three MobileNetV2 models were trained on images cropped according to the experts' reference bounding box annotations, each using different combinations of learning rates and data augmentation. After selecting the best detector-classifier combination, we integrated them into a two-step pipeline in which the images cropped by the detector were subsequently forwarded to the classifier. RESULTS: The best YOLOv11 configuration achieved a mAP50 of 0.820, precision of 0.897, recall of 0.744, and F1-score of 0.813. For classification, the best MobileNetV2 configuration achieved an accuracy of 0.846, precision of 0.871 recall of 0.846, F1-score of 0.844, and AUC-ROC of 0.852. On external validation, this same model reached an accuracy of 0.850, precision of 0.866, recall of 0.850, F1-score of 0.851, and an AUC-ROC of 0.935. The two-step approach, when applied to the test set from the baseline dataset, achieved an accuracy of 0.784, precision of 0.793, recall of 0.784, F1-score of 0.784, and an AUC-ROC of 0.811. When evaluated on the external validation dataset, it yielded an accuracy of 0.863, precision of 0.879, recall of 0.863, F1-score of 0.866, and an AUC-ROC of 0.934. The visual inspection of YOLO's inference outputs confirmed consistent lesion localization across diverse oral cavity images, with some missing (17.4%). The t-SNE visualization demonstrated partial separation between oral potentially malignant disorder (OPMD) and oral squamous cell carcinoma (OSCC) feature embeddings, indicating the model captured discriminative patterns with some class overlap. CONCLUSION: This proof-of-concept study demonstrates the feasibility of a two-step artificial intelligence (AI) pipeline combining object detection and classification to support early diagnosis of oral diseases. However, caution is warranted when interpreting the results of two-step approaches, as images missed by YOLO during detection are excluded from the classification stage, which may affect the reported performance metrics.