Abstract
BACKGROUND: Around 30 million people in Europe are affected by a rare (or orphan) disease, defined as a condition occurring in fewer than 1 in 2,000 individuals. The primary challenge is to automatically and efficiently identify scientific articles and guidelines that address a particular rare disease. We present a novel methodology to annotate and index scientific text with taxonomical concepts describing rare diseases from the OrphaNet taxonomy. This task is complicated by several technical challenges, including the lack of sufficiently large, human-annotated datasets for supervised training and the polysemy/synonymy and surface-form variation of rare disease names, which can hinder any annotation engine. RESULTS: We introduce a framework that operationalizes OrphaNet for large-scale literature annotation by integrating the TERMite engine with curated synonym expansion, label normalization (including deprecated/renamed concepts), and fuzzy matching. On benchmark datasets, the approach achieves precision = 92%, recall = 75%, and F1 = 83%, outperforming an string-matching baseline. Applying the pipeline to Scopus produces disease-specific corpora suitable for bibliometric and scientometric analyses (e.g., institution, country, and subject-area profiles). These outputs power the Rare Diseases Monitor dashboard for exploring national and global research activity. CONCLUSION: To our knowledge, this is the first systematic, scalable semantic framework for annotating and indexing rare disease literature at scale. By operationalizing OrphaNet in an automated, reproducible pipeline and addressing data scarcity and lexical variability, the work advances biomedical semantics for rare diseases and enables disease-centric monitoring, evaluation, and discovery across the research landscape.