Boosting Transcript Assembly via Delineating Transcript Start and End Sites

通过确定转录本起始和终止位点来提高转录本组装效率

阅读:1

Abstract

Transcript assembly remains a challenging task despite the development of numerous methods. A major contributor to low assembly accuracy is the difficulty in accurately determining transcript start sites (TSSs) and end sites (TESs), due to the weak and noisy signals typically found in RNA-seq data. We present Telos, a two-stage machine learning framework for precise detection of TSSs and TESs and for transcript ranking. The method takes as input any assembly, typically generated by an existing assembler. In the first stage, Telos scores the TSSs and TESs in the input assembly using a machine learning model trained on a rich set of engineered features. These site-level scores will be passed to the second stage for transcript-level evaluation. In its second stage, Telos scores the entire transcripts by training another model that integrates features of their TSS and TES (including the inferred probabilities from the first stage), along with transcript abundance estimated by the assembler and statistics about exon lengths. We extensively evaluated Telos on ONT (cDNA and direct RNA), PacBio, and Illumina short-read RNA-seq datasets. In all cases, it consistently outperformed baseline methods. Telos is agile, but achieves substantial improvements, demonstrating the value of explicitly modeling TSS and TES, a gap in current transcript assembly tools. Telos can be paired with any assembler to accurately score the assembled transcripts. It is modular, easily extensible to emerging sequencing technologies, and hence we anticipate its broad adoption in transcriptomic studies.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。