Abstract
Despite the long-held view of Mycobacterium tuberculosis (Mtb) as a genetically conserved pathogen, many genomic regions remain poorly resolved due to high sequence homology and repetitive content. Using complete genome assemblies generated from long-read sequencing of 151 globally representative clinical isolates, we comprehensively analyzed genome-wide patterns of genetic diversity and evolution across the Mtb genome. Our analysis uncovers pronounced diversity hotspots within paralogous regions generated by recurrent gene conversion between homologous genes. In many cases, these hotspots exhibit more than an order of magnitude greater genetic diversity than the rest of the Mtb genome, which is otherwise characterized by remarkably low variation. Mutations within these regions display clustered substitution patterns, excess paralog-matching variants, and distinct mutational spectra consistent with ongoing gene conversion. Our analysis identifies over 300 individual gene conversion events distributed throughout the Mtb phylogeny. These gene conversion events occur predominantly within gene families associated with virulence and host-pathogen interactions, including the PE, PPE, and ESX families. Several of the most pronounced diversity hotspots occur in antigens encoded within paralogous regions. Among these, the vaccine candidate PPE18 harbors mutations in validated epitope sequences and predicted alterations in HLA-II binding. Together, these findings demonstrate that gene conversion actively shapes antigenic and virulence-associated diversity in Mtb.