Abstract
Accurately identifying genes responsible for specific functions is a cornerstone of biological research, but current methods are often limited to single-species analyses. Here, we present a novel method, called Genomic and Phenotype-based machine learning for Gene Identification (GPGI), that leverages large-scale, cross-species genomic and phenotypic data for functional gene discovery. Using bacterial rod-shape determination as a case study, we demonstrate GPGI's ability to rapidly identify key genes. Our approach uses machine learning to predict bacterial shape from protein structural domain profiles, identifying influential domains whose corresponding genes are selected for experimental validation. Focused gene knockouts in Escherichia coli confirmed the critical roles of two genes, pal and mreB, in maintaining rod-shaped morphology. We further validated GPGI's robustness by demonstrating its consistent performance even with reduced datasets. GPGI thus offers a rapid, accurate, and efficient way to identify multiple key genes associated with complex traits across diverse organisms.