Abstract
Transcript assembly remains a challenging task despite the development of numerous methods. A major contributor to low assembly accuracy is the difficulty in accurately determining transcript start sites (TSSs) and end sites (TESs), due to the weak and noisy signals typically found in RNA-seq data. We present Telos, a two-stage machine learning framework for precise detection of TSSs and TESs and for transcript ranking. The method takes as input any assembly, typically generated by an existing assembler. In the first stage, Telos scores the TSSs and TESs in the input assembly using a machine learning model trained on a rich set of engineered features. These site-level scores will be passed to the second stage for transcript-level evaluation. In its second stage, Telos scores the entire transcripts by training another model that integrates features of their TSS and TES (including the inferred probabilities from the first stage), along with transcript abundance estimated by the assembler and statistics about exon lengths. We extensively evaluated Telos on ONT (cDNA and direct RNA), PacBio, and Illumina short-read RNA-seq datasets. In all cases, it consistently outperformed baseline methods. Telos is agile, but achieves substantial improvements, demonstrating the value of explicitly modeling TSS and TES, a gap in current transcript assembly tools. Telos can be paired with any assembler to accurately score the assembled transcripts. It is modular, easily extensible to emerging sequencing technologies, and hence we anticipate its broad adoption in transcriptomic studies.