Abstract
Understanding the genetic code of cis-regulatory elements (CREs) is essential for engineering gene expression and modulating agronomic traits in crops. In plants, CREs underlying rapid evolution of gene expression often overlap with structural variation in promoters, making them undetectable using single-reference genomes. Here, we develop K-PROB (K-mer-based in silico PROmoter Bashing), a computational tool that learns from intraspecies promoter sequence and gene expression variation in pan-genomes and pan-transcriptomes to identify CREs controlling gene expression. K-PROB deploys a k-mer-based Bayesian variable selection framework to prioritize causal variable identification. We demonstrate the effectiveness of our approach in maize and soybean, two staple crops species. Applying K-PROB to genes with the most highly variable promoter sequences and the most diverse patterns of expression, such as nucleotide-binding leucine-rich repeat receptors, we identified k-mers enriched for bona fide transcription factor binding sequences, and overlapping with open chromatin regions and DAP-seq binding sites. Notably, multiple significant k-mers are located within presence/absence structural variants, highlighting structural variation in promoters as key drivers of transcriptional diversity of highly variable genes. We further validated the regulatory effects of identified k-mers on gene expression using luciferase reporter assays. Our results showcase a high-throughput and pangenomic approach for probing natural intraspecies cis-regulatory diversity, discovering new causative cis-elements, and facilitating future expression engineering across plant species.