Abstract
MOTIVATION: Species-specific orphan genes lack homologues outside of a given taxon and frequently underlie unique species traits. Orphans can result from sequence divergence beyond recognition, when homologous proteins diverge to an extent at which sequence similarity search algorithms can no longer identify them as homologues, but they can also evolve de novo from previously noncoding sequences, in which case homologous protein-coding genes truly do not exist. RESULTS: Here we propose that sequence divergent orphans might be recognizable from their patterns of non-statistically significant similarity hits which are typically discarded. To test this, we simulated diverged orphan protein sequences under varying parameters. Using reversed protein sequences as negative control, we trained machine learning classifiers on features extracted from similarity search output. We found that this approach works, but performance of the models depends on the simulation parameters, with ∼90% accuracy when the underlying simulated divergence was moderate and ∼70% when it is extreme. When applying our classifiers on a set of real orphans we found that ∼30% of them are predicted to be divergent and these are shorter and more disordered than the rest. Our work contributes to the effort of better understanding how genetic novelty arises. AVAILABILITY AND IMPLEMENTATION: The models and data used can be found at https://github.com/emiliostassios/Classification-of-divergent-genes-using-ML.