Abstract
BACKGROUND: In the diagnostic process of monogenic genetic disorders, identifying pathogenic variants is a crucial step. Thanks to the widespread adoption of Next-Generation Sequencing (NGS) technology, diagnostic efficiency has been significantly enhanced. However, with the increasing demand for diagnostic accuracy in clinical practice for monogenic genetic diseases, accurately and swiftly pinpointing pathogenic variants among numerous candidate variants remains a significant challenge. The complexity of data analysis and interpretation continues to limit both the efficiency and accuracy of diagnosis. METHODS: In this study, we have developed an innovative phenotype-driven algorithm, geneEX. This algorithm integrates large language model technology to accurately extract phenotypes from clinical information and automatically acquire Human Phenotype Ontology (HPO) information through a semantic vector representation model, thereby identifying HPO-associated genes. Additionally, it supports semantic matching between patients' free-text phenotypic descriptions and disease phenotypes, further enhancing the identification of pathogenic genes. The algorithm can rank candidate causative variants, enabling rapid and precise identification of potential pathogenic variants in rare genetic disorders. RESULTS: geneEX demonstrates commendable performance in ranking pathogenic variants across both virtual and clinical datasets. The supplementary matching of phenotypes in free-text form significantly enhances the precision of candidate variant prioritization for samples. CONCLUSION: geneEX has achieved automated HPO acquisition through its independently developed phenotype extraction and standardization methods, thereby enabling the full-process automated identification from clinical samples to pathogenic variants. Additionally, by integrating free-text phenotypic descriptions with disease phenotype matching, it enhances the accuracy of pathogenic gene identification. This innovative approach significantly improves the precision and efficiency of identifying pathogenic variants in rare genetic disorders, providing robust support for the diagnosis of monogenic diseases.