Abstract
Proteomes are typically analyzed at the level of individual proteins or protein families. In this study, we introduce a bottom-up approach that treats proteomes as holistic entities by examining the properties of k-mers within entire proteomes and protein groups. We performed a comprehensive analysis of short amino acid k-mer (k = 1, 2, 3) distributions across all proteins in a given proteome. Using 86 bacterial proteomes representing 18 clades, we evaluated whether k-mer frequencies characterize uniquely the analyzed organisms. Remarkably, in a post hoc analysis, we found that the k-mer frequency vector unambiguously coevolves with the entire proteome-a pattern not observed even within specific protein groups, such as conserved ribosomal proteins or more variable nucleotide-binding proteins. This finding holds regardless of the k-mer calculation parameters or the distance metrics employed. Our results show that even a simple analysis based on tripeptide frequencies can precisely position proteomes within the k-mer space. Moreover, relationships derived from k-mer comparisons highly correlate with evolutionary relationships derived from phylogenetic trees, reaching up to 99% match with reference classification of the proteomes within major bacterial clades. These findings establish k-mer-based proteomic analysis as an additional robust and powerful feature for characterizing evolutionary relationships, opening new pathways in phylogenetics and evolutionary genomics.