Abstract
Alignment-free signals in coding sequences provide a scalable route to taxonomic inference, quality control of large phylogenies, and rapid screening of genomic data. We present a model-agnostic framework that represents genomes or coding-sequence collections using codon usage (64-D) and codon-pair usage (4096-D) profiles, and we introduce Taxonomic Consistency (TC), a simple, rank-aware external index, for evaluating supervised predictions or unsupervised clustering against the hierarchical taxonomy. Across multiple taxonomic ranks (e.g., Domain, Order, etc.), compact supervised models and standard clustering methods are assessed with both internal (Silhouette) and external (TC) validation. In large-scale experiments, handling class imbalance and applying principled normalization had a greater impact on performance than sequence-level preprocessing, and codon usage profiles yielded the highest TC and coherent unsupervised structure, while codon-pair features provided complementary resolution within specific clades. We release open, versioned code and reproducible workflows to facilitate adoption in ecological and evolutionary pipelines. Taken together, alignment-free codon and codon-pair representations, paired with TC for transparent external validation, offer a practical complement to tree-based methods-especially for rapid screening, sanity checks, and exploratory analyses at higher ranks.