An NLP-based method to mine gene and function relationships from published articles

一种基于自然语言处理的方法,用于从已发表的文章中挖掘基因与功能之间的关系。

阅读:1

Abstract

Understanding the intricacies of genes function within biological systems is paramount for scientific advancement and medical progress. Owing to the evolving landscape of this research and the complexity of biological processes, however, this task presents challenges. We introduce PATHAK, a natural language processing (NLP)-based method that mines relationships between genes and their functions from published scientific articles. PATHAK utilizes a pre-trained Transformer language model to generate sentence embeddings from a vast dataset of scientific documents. This enables the identification of meaningful associations between genes and their potential functional annotations. Our approach is adaptable and applicable across diverse scientific domains. Applying PATHAK to over 17,000 research articles focused on Arabidopsis thaliana, we assigned approximately 1493 GO terms to 10,976 genes by analyzing article sentences, comparing their embeddings to GO term embeddings, and mapping potential matches. The model demonstrates moderate-to-high predictive accuracy, capturing ~ 57% overlap of GO terms (6258 out of 10,976) between predicted and known annotations on TAIR, including 1271 and 161 exact matches and 4826 partially related terms. This method promises to significantly advance our understanding of gene functionality and potentially accelerate discoveries in the context of plant development, growth and stress responses in plants and other systems.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。