A Distributed Classifier for MicroRNA Target Prediction with Validation Through TCGA Expression Data

基于TCGA表达数据的microRNA靶标预测分布式分类器

阅读:1

Abstract

BACKGROUND: MicroRNAs (miRNAs) are approximately 22-nucleotide long regulatory RNA that mediate RNA interference by binding to cognate mRNA target regions. Here, we present a distributed kernel SVM-based binary classification scheme to predict miRNA targets. It captures the spatial profile of miRNA-mRNA interactions via smooth B-spline curves. This is accomplished separately for various input features, such as thermodynamic and sequence-based features. Further, we use a principled approach to uniformly model both canonical and non-canonical seed matches, using a novel seed enrichment metric. Finally, we verify our miRNA-mRNA pairings using an Elastic Net-based regression model on TCGA expression data for four cancer types to estimate the miRNAs that together regulate any given mRNA. RESULTS: We present a suite of algorithms for miRNA target prediction, under the banner Avishkar, with superior prediction performance over the competition. Specifically, our final kernel SVM model, with an Apache Spark backend, achieves an average true positive rate (TPR) of more than 75 percent, when keeping the false positive rate of 20 percent, for non-canonical human miRNA target sites. This is an improvement of over 150 percent in the TPR for non-canonical sites, over the best-in-class algorithm. We are able to achieve such superior performance by representing the thermodynamic and sequence profiles of miRNA-mRNA interaction as curves, devising a novel seed enrichment metric, and learning an ensemble of miRNA family-specific kernel SVM classifiers. We provide an easy-to-use system for large-scale interactive analysis and prediction of miRNA targets. All operations in our system, namely candidate set generation, feature generation and transformation, training, prediction, and computing performance metrics are fully distributed and are scalable. CONCLUSIONS: We have developed an efficient SVM-based model for miRNA target prediction using recent CLIP-seq data, demonstrating superior performance, evaluated using ROC curves for different species (human or mouse), or different target types (canonical or non-canonical). We analyzed the agreement between the target pairings using CLIP-seq data and using expression data from four cancer types. To the best of our knowledge, we provide the first distributed framework for miRNA target prediction based on Apache Hadoop and Spark. AVAILABILITY: All source code and sample data are publicly available at https://bitbucket.org/cellsandmachines/avishkar. Our scalable implementation of kernel SVM using Apache Spark, which can be used to solve large-scale non-linear binary classification problems, is available at https://bitbucket.org/cellsandmachines/kernelsvmspark.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。