Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy

非对称三分法分区克服了构建用于预测 siRNA 功效的机器学习模型的数据集限制

阅读:7
作者:Kathryn R Monopoli, Dmitry Korkin, Anastasia Khvorova

Abstract

Chemically modified small interfering RNAs (siRNAs) are promising therapeutics guiding sequence-specific silencing of disease genes. Identifying chemically modified siRNA sequences that effectively silence target genes remains challenging. Such determinations necessitate computational algorithms. Machine learning is a powerful predictive approach for tackling biological problems but typically requires datasets significantly larger than most available siRNA datasets. Here, we describe a framework applying machine learning to a small dataset (356 modified sequences) for siRNA efficacy prediction. To overcome noise and biological limitations in siRNA datasets, we apply a trichotomous, two-threshold, partitioning approach, producing several combinations of classification threshold pairs. We then test the effects of different thresholds on random forest machine learning model performance using a novel evaluation metric accounting for class imbalances. We identify thresholds yielding a model with high predictive power, outperforming a linear model generated from the same data, that was predictive upon experimental evaluation. Using a novel model feature extraction method, we observe target site base importances and base preferences consistent with our current understanding of the siRNA-mediated silencing mechanism, with the random forest providing higher resolution than the linear model. This framework applies to any classification challenge involving small biological datasets, providing an opportunity to develop high-performing design algorithms for oligonucleotide therapies.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。