BACKGROUND: Orthogonal confirmation of variants identified by next-generation sequencing (NGS) is routinely performed in many clinical laboratories to improve assay specificity. However, confirmatory testing of all clinically significant variants increases both turnaround time and operating costs for laboratories. Improvements to early NGS methods and bioinformatics algorithms have dramatically improved variant calling accuracy, particularly for single nucleotide variants (SNVs), thus calling into question the necessity of confirmatory testing for all variant types. The purpose of this study is to develop a new machine learning approach to capture false positive heterozygous variants (SNVs) from whole exome sequencing (WES) data. RESULTS: WES variant calls from Genome in a Bottle (GIAB) cell lines and their associated quality features were used to train five different machine learning models to predict whether a variant was a true positive or false positive based on quality metrics. Logistic regression and random forest models exhibited the highest false positive capture rates among the selected models, but GradientBoosting achieved the best balance between false positive capture rates and true positive flag rates. Further assessment using simulated false positive events as well as different combinations of quality features showed that model performance can be refined. Integration of the highest-performing models into a custom two-tiered confirmation bypass pipeline with additional guardrail metrics achieved 99.9% precision and 98% specificity in the identification of true positive heterozygous SNVs within the GIAB benchmark regions. Furthermore, testing on an independent set of heterozygous SNVs (nâ=â93) detected by exome sequencing of patient samples and cell lines demonstrated 100% accuracy. CONCLUSIONS: Machine-learning models can be trained to classify SNVs into high or low-confidence categories with high precision, thus reducing the level of confirmatory testing required. Laboratories interested in deploying such models should consider incorporating additional quality criteria and thresholds to serve as guardrails in the assessment process. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-025-11889-z.
Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation.
阅读:2
作者:Yan Muqing, Zeng Qiandong, Zhang Zhenxi, Okamoto Patricia, Letovsky Stanley, Kenyon Angela, Leach Natalia, Reiner Jennifer
期刊: | BMC Genomics | 影响因子: | 3.700 |
时间: | 2025 | 起止号: | 2025 Aug 6; 26(1):728 |
doi: | 10.1186/s12864-025-11889-z |
特别声明
1、本文转载旨在传播信息,不代表本网站观点,亦不对其内容的真实性承担责任。
2、其他媒体、网站或个人若从本网站转载使用,必须保留本网站注明的“来源”,并自行承担包括版权在内的相关法律责任。
3、如作者不希望本文被转载,或需洽谈转载稿费等事宜,请及时与本网站联系。
4、此外,如需投稿,也可通过邮箱info@biocloudy.com与我们取得联系。