Abstract
Rare disease detection and classification is one of the most significant challenges in the application of Natural Language Processing techniques to the analysis and extraction of information from biomedical texts. In this paper, we present a novel research focused on the detection and classification of rare diseases in clinical notes extracted from a cohort of pediatric patients from the Community of Madrid in Spain. From a set of collected and anonymized medical records, we propose a semi-supervised, keyphrase-based system to perform an initial detection of mentions of rare diseases, which is then validated and refined by experts to build a consolidated dataset concerning a subset of different rare diseases. Based on this dataset, we carry out a series of experiments for rare disease classification using both a semi-supervised technique and state-of-the-art supervised systems based on both discriminative and generative models. A detailed case analysis provides insights on which systems excel in specific scenarios and why. The validated dataset contains a total of 1900 annotated texts containing mentions to rare diseases. Experiments on this dataset show that the best supervised models improve the performance of the semi-supervised system by more than 10% (78.74% vs 67.37% micro-average F-Measure), individually enhancing the classification of a significant number of diseases in the dataset. State-of-the-art supervised systems are able to offer promising results on the detection and classification of rare diseases in clinical texts, even in cases for which the amount of annotated information is low. On the other hand, semi-supervised models present interesting capabilities for dealing with limited information and data in the field.