Phylogenetic analysis of protein sequences provides a powerful means of identifying novel protein functions and subfamilies, and for identifying and resolving annotation errors. However, automation of functional clustering based on phylogenetic trees has been challenging and most of it is done manually. Clustering phylogenetic trees usually requires the delineation of tree-based thresholds (e.g., distances), leading to an ad hoc problem. We propose a new phylogenetic clustering approach that identifies clusters without using ad hoc distances or other pre-defined values. Our workflow combines uniform manifold approximation and projection (UMAP) with Gaussian mixture models as a k-means like procedure to automatically group sequences into clusters. We then apply a "second pass" clade identification algorithm to resolve non-monophyletic groups. We tested our approach with several well-curated protein families (outer membrane porins, acyltransferase, and nuclear receptors) and showed our automated methods recapitulated known subfamilies. We also applied our methods to a broad range of different protein families from multiple databases, including Pfam, PANTHER, and UniProt, and to alignments of RNA viral genomes. Our results showed that AutoPhy rapidly generated monophyletic clusters (subfamilies) within phylogenetic trees evolving at very different rates both within and among phylogenies. The phylogenetic clusters generated by AutoPhy resolved misannotations and identified new protein functional groups and novel viral strains.
AutoPhy: Automated phylogenetic identification of novel protein subfamilies.
阅读:4
作者:Ortiz-Velez Adrian N, Sukumaran Jeet, Rouzbehani Ryin, Kelley Scott T
| 期刊: | PLoS One | 影响因子: | 2.600 |
| 时间: | 2024 | 起止号: | 2024 Jan 11; 19(1):e0291801 |
| doi: | 10.1371/journal.pone.0291801 | ||
特别声明
1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。
2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。
3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。
4、投稿及合作请联系:info@biocloudy.com。
