Abstract
Within-host evolution plays a critical role in shaping the diversity of SARS-CoV-2. However, understanding the primary factors contributing to the prevalence of intra-host single nucleotide variants (iSNVs) in the viral population remains elusive. Here, we conducted a comprehensive analysis of over 556,000 SARS-CoV-2 sequencing data and prevalence data of different SARS-CoV-2 S protein amino acid mutations to elucidate key factors influencing the prevalence of iSNVs in the SARS-CoV-2 S gene. Within-host diversity analysis revealed the presence of mutational hotspots within the S gene, mainly located in NTD, RBD, TM, and CT domains. Additionally, we generated a single amino acid resolution selection status map of the S protein. We observed a significant variance in within-host fitness among iSNVs in the S protein. The majority of iSNVs exhibited low to no within-host fitness and displayed low alternate allele frequency (AAF), suggesting that they will be eliminated due to the narrow transmission bottleneck of SARS-CoV-2. Notably, iSNVs with moderate AAFs (0.06-0.12) were found to be more prevalent than those with high AAFs. Furthermore, iSNVs with the potential to alter antigenicity were more prevalent. These findings underscore the significance of within-host fitness and antigenicity shift as two key factors influencing the prevalence of iSNVs in the SARS-CoV-2 S gene.