An Alignment Free Framework for Taxonomic Inference From Codon and Codon-Pair Usage

基于密码子和密码子对使用情况的分类推断的无比对框架

阅读:2

Abstract

Alignment-free signals in coding sequences provide a scalable route to taxonomic inference, quality control of large phylogenies, and rapid screening of genomic data. We present a model-agnostic framework that represents genomes or coding-sequence collections using codon usage (64-D) and codon-pair usage (4096-D) profiles, and we introduce Taxonomic Consistency (TC), a simple, rank-aware external index, for evaluating supervised predictions or unsupervised clustering against the hierarchical taxonomy. Across multiple taxonomic ranks (e.g., Domain, Order, etc.), compact supervised models and standard clustering methods are assessed with both internal (Silhouette) and external (TC) validation. In large-scale experiments, handling class imbalance and applying principled normalization had a greater impact on performance than sequence-level preprocessing, and codon usage profiles yielded the highest TC and coherent unsupervised structure, while codon-pair features provided complementary resolution within specific clades. We release open, versioned code and reproducible workflows to facilitate adoption in ecological and evolutionary pipelines. Taken together, alignment-free codon and codon-pair representations, paired with TC for transparent external validation, offer a practical complement to tree-based methods-especially for rapid screening, sanity checks, and exploratory analyses at higher ranks.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。