Abstract
Traditional methods for identifying associations between genomic features and traits, or between pairs of genomic traits, struggle when applied to bacterial genomes. While several microbial GWAS (mGWAS) methods have been developed to account for the fact that genome-wide linkage in bacteria creates strong evolutionary-induced associations, these methods have high false discovery rates or lack statistical power, have poor performance on negative interactions, and face computational limits at the scale required for pangenome-wide study of gene-gene interactions. Here, we present SimPhyNI, a computationally optimized framework for efficient and rigorous mGWAS studies. SimPhyNI builds null co-occurrence distributions by independently simulating traits using phylogenetically-informed parameters, novelly including time to first event. The constrained variation in these simulations, combined with log odds ratio scoring for comparing across traits, robustly identifies both positive and negative associations. Using synthetic datasets mimicking both gene-gene and gene-trait associations, we demonstrate that SimPhyNI achieves high precision and recall for both positive and negative interactions. We demonstrate SimPhyNI's utility by detecting interactions between phage defense systems in E. coli and gene-gene interactions across the entire E. coli pangenome (>9 million tests). Though developed here for binary traits, SimPhyNI's design supports extension to multi-state and continuous traits using generalized models of stochastic simulation. SimPhyNI's performance and scalability enable genome-wide discovery of genetic interactions that drive microbial function, ecology, and disease. IMPACT STATEMENT: Understanding how bacterial genes associate with traits and with one another is essential for predicting disease outcomes, antibiotic resistance, and future evolution. However, identifying these interactions is challenging because shared ancestry creates false correlations. SimPhyNI overcomes this through an ancestry-informed statistical simulation process, achieving near-zero false positive rates while maintaining computational efficiency for large scale analyses. This efficiency enables systematic mapping of gene-gene interaction networks across large datasets containing thousands of genes and genomes. As microbial genomic datasets continue to expand, SimPhyNI's scalability and precision will accelerate discovery of the mechanistic principles underlying infectious disease, microbiome function, and microbial evolution and ecology.