Abstract
OBJECTIVES: This study aimed to develop robust machine learning (ML)-based and deep learning (DL)-based models capable of detecting mpox cases for surveillance efforts using clinical notes. METHODS: As part of a learning health system initiative, we conducted a retrospective study of clinical encounters at the Columbia University Irving Medical Center in New York City. We included patients with mpox diagnoses confirmed by PCR testing between 15 May 2022 and 15 October 2022 and three matched controls for each case based on patient age, sex, race, ethnicity and visit month. We trained three mpox surveillance models using: (1) logistic regression with L1 regularisation (least absolute shrinkage and selection operator (LASSO)), (2) ClinicalBERT and (3) ClinicalLongformer. We evaluated model performance using precision, recall, F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC) and recall at 80% precision (RP80). RESULTS: The study included 228 PCR-confirmed mpox cases and 698 controls. LASSO regression outperformed the DL models with a precision, recall and F1 score of 0.93, AUROC of 0.97, AUPRC of 0.93 and RP80 of 0.89. ClinicalBERT achieved a precision of 0.88, recall of 0.89, F1 score of 0.88 and AUROC of 0.93. ClinicalLongformer achieved a precision of 0.87, recall of 0.88, F1 score of 0.87 and AUROC of 0.92. Phrases related to symptoms (eg, lesions and pain) were among the most predictive features in LASSO regression. CONCLUSIONS: ML and DL models based on clinical notes show promise for identifying mpox cases. In this study, LASSO regression outperformed DL models and excelled in minimising false positives. These findings highlight the potential for ML and DL methods to support case surveillance for mpox and other infectious diseases. These methods may also prove helpful for flagging missed or delayed diagnoses as part of continuous quality improvement.