Machine learning can distinguish orphans that have resulted from sequence divergence beyond recognition

机器学习可以区分因序列差异过大而导致的孤立基因,这些孤立基因的识别能力已无法保证。

阅读:1

Abstract

MOTIVATION: Species-specific orphan genes lack homologues outside of a given taxon and frequently underlie unique species traits. Orphans can result from sequence divergence beyond recognition, when homologous proteins diverge to an extent at which sequence similarity search algorithms can no longer identify them as homologues, but they can also evolve de novo from previously noncoding sequences, in which case homologous protein-coding genes truly do not exist. RESULTS: Here we propose that sequence divergent orphans might be recognizable from their patterns of non-statistically significant similarity hits which are typically discarded. To test this, we simulated diverged orphan protein sequences under varying parameters. Using reversed protein sequences as negative control, we trained machine learning classifiers on features extracted from similarity search output. We found that this approach works, but performance of the models depends on the simulation parameters, with ∼90% accuracy when the underlying simulated divergence was moderate and ∼70% when it is extreme. When applying our classifiers on a set of real orphans we found that ∼30% of them are predicted to be divergent and these are shorter and more disordered than the rest. Our work contributes to the effort of better understanding how genetic novelty arises. AVAILABILITY AND IMPLEMENTATION: The models and data used can be found at https://github.com/emiliostassios/Classification-of-divergent-genes-using-ML.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。