Abstract
A challenge in population ecology studies is identifying how to best group individuals into populations, especially when individual origin is unknown. Machine learning has improved upon traditional methods of identifying population structure and is more efficient at handling large, complex datasets. We demonstrate the applicability of a machine learning method to identify hierarchical population structure in an emerging pathogen, Coccidioides spp., the causative agent of Valley fever. We compared the network clusters to structure identified by traditional tools as a validation of the network performance. We used publicly available whole-genome data for 48 C. immitis and 102 C. posadasii, resulting in 168,211 genome-wide SNPs among the two species. The network analysis grouped samples into populations comparable to the literature for these species but also identified fine-scale geographic structure and travel-associated cases not reported thus far. Exploring different resolutions in the network made it easy to identify unique genotypes specific to California and possibly Nevada, as well as Phoenix- and Tucson-acquired infections in non-endemic areas, regardless of reported travel history. The present study provides a promising example of how a ML-based network analysis can improve our ability to understand pathogen ecology, group cases into populations and infer travel-associated infections.