The impact of bioinformatic choices on Coccidioides variant identification accuracy

生物信息学选择对球孢子菌变异株鉴定准确性的影响

阅读:2

Abstract

Emerging fungal pathogens, such as Coccidioides, the causative agent of Valley fever, or coccidioidomycosis, pose significant clinical and public health challenges. While advances in genomic epidemiology have enhanced our understanding of Coccidioides evolutionary history, effective variant identification is complicated by the genome's structural complexity. Repetitive elements, transposable sequences, and regions of low complexity can lead to incorrect variant calls, affecting downstream analyses. Further, accurate species identification is essential for understanding the spread of C. immitis and C. posadasii, which, despite having distinct primary geographic distributions, can co-occur. Distinguishing between these species is critical for interpreting patterns of transmission, emergence, and potential shifts in endemicity. To address this, we developed a pipeline to identify genetic variants and assign species directly from sequencing reads. We evaluated the performance of variant identification both across the genome and after excluding repetitive regions identified by NUCmer, a commonly used tool, on simulated genomic data and empirically generated sequence data. Whole-genome calling detected the highest number of single-nucleotide polymorphisms (SNPs), over 80,000 on average in both species, but included a substantial number of false positives, with 42,834 true positives and 38,115 false positives identified. Masking repetitive regions significantly enhanced accuracy. In C. immitis, masking with NUCmer increased sensitivity from 70.1% to 91.7% and precision from 52.7% to 91.1%. Similarly, in C. posadasii, sensitivity improved from 80.0% to 96.1% and precision from 53.1% to 90.4%. These improvements were also reflected in overall F1 scores, which rose from 60%-64% in whole-genome analysis to over 90% after masking. Using simulated reads, our pipeline recovered 83,400 SNPs in C. posadasii, with 40,163 shared across regions and a Jaccard index of 0.36. Species classification was highly accurate-100% in simulations and 98.9% in 175 publicly available samples. Here, we provide a benchmarked variant and species identification pipeline for Coccidioides and quantify the impact of genomic region on variant identification performance, which may have downstream impacts on phylogenetic and genomic epidemiology inference.IMPORTANCEAccurate genetic analysis is essential for tracking and understanding emerging fungal pathogens like Coccidioides, the cause of Valley fever. However, the complex structure of fungal genomes makes it difficult to identify genetic differences reliably. This study demonstrates that the choice of genomic regions has a substantial impact on variant detection accuracy. We developed and tested a new tool called cocci-call and found that focusing on specific regions of the genome dramatically improves the accuracy of genetic variant detection. This improvement could enhance how researchers monitor outbreaks, track fungal evolution, and design better diagnostics. By identifying high-confidence regions for analysis, our work helps standardize how Coccidioides genomes are studied and compared, laying the groundwork for more accurate and reproducible genomic research in this important pathogen.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。