Abstract
We applied supervised and unsupervised machine learning (ML) analyses to a cohort of 140 patients referred to the Hematology Unit of the G. Gaslini Institute from 1989 to 2023 for persistent cytopenia and/or features suggestive of telomere biology disorders (TBDs). Patients were labeled as "TBD" (n = 20, established molecular diagnosis of TBD), "other diagnosis" (OD, n = 27, established molecular diagnosis of congenital disease including marrow failures), and "undefined diagnosis" (UD, n = 93, no established molecular diagnosis). After training a random forest model on 47 patients with established molecular diagnosis (20 TBD and 27 OD), supervised analysis was applied to the UD group and predicted 16/93 patients as having potential TBD and 77/93 subjects with potential OD, accounting for 17.2% and 82.7% of possibly reallocated diagnoses, respectively. The unsupervised approach applied to the whole cohort (n = 140) identified 4 distinct clusters to be significantly associated (P = 0.000001) with 47 molecular diagnoses, with TBD patients prevailing in Clusters 1 and 2 and OD patients in Clusters 3 and 4. Telomere length (TL) and mucocutaneous abnormalities were the most relevant drivers in discriminating between the TBD and OD groups in supervised and unsupervised analyses; they prevailed in Clusters 1 and 2. Interestingly, both analyses yielded similar results in the UD group, where all 16/93 patients without molecular diagnosis predicted to have TBD in the supervised approach were placed in "TBD clusters" 1-2 of the unsupervised analysis. This model might correctly reallocate a remarkable proportion of undefined or previously misclassified cases, thus potentially leading to substantially improved diagnostic work-up of rare and challenging diseases like TBD.