Abstract
16S rRNA amplicon sequencing is widely used for microbiome profiling, but most methods rely on reference databases of characterized organisms, limiting its accuracy in function prediction for underrepresented environments. We discovered that 16S rRNA k-mer composition carries substantial functional signal: (i) whole-genome k-mer profiles predict genome-encoded functions, and (ii) 16S rRNA k-mer profiles reflect their source genome's composition. Building on these relationships, we developed embeRNA, a neural network framework that predicts functions directly from 16S rRNA k-mer embeddings without requiring taxonomy assignment or phylogenetic placement. embeRNA outputs per-function probability scores, enabling users to tune decision thresholds to balance precision and recall or account for community novelty. In a stringent "novel microbes" benchmark - where all test sequences shared <97% identity with training data - embeRNA outperformed reference-based methods, particularly for hard-to-label functions. Applied to soil metagenomes with paired 16S and whole metagenome shotgun sequencing (WMS) data, embeRNA recovered most WMS-inferred functions and produced abundance profiles strongly correlated with WMS results, attaining better performance than a reference-based approach. Our findings demonstrate that 16S rRNA directly captures functional potential, and 16S amplicon sequencing data can complement WMS-based inference to broaden functional characterization of microbiomes, especially in understudied environments.