Evaluating computational tools for protein-coding sequence detection: Are they up to the task?

评估用于蛋白质编码序列检测的计算工具:它们能胜任这项任务吗?

阅读:1

Abstract

Detecting protein-coding genes in nucleotide sequences is a significant challenge for understanding genome and transcriptome function, yet the reliability of bioinformatic tools for this task remains largely unverified. This is despite some tools being available for several decades and widely used for genome and transcriptome annotation. We perform an assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training data set and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets. Our work demonstrates that several widely used tools are neither accurate nor computationally efficient for the protein-coding sequence detection problem. In fact, just three of nine tools significantly outperformed a naive scoring scheme. Furthermore, we note a high discrepancy between self-reported accuracies and the accuracy achieved in our study. Our results show that the extra dimension from conserved and variable nucleotides in alignments has a significant advantage over single-sequence approaches. These results highlight significant limitations in existing protein-coding annotation tools that are widely used for lncRNA annotation. This shows a need for more robust and efficient approaches to training and assessing the performance of tools for identifying protein-coding sequences. Our study paves the way for future advancements in comparative genomic approaches, and we hope will popularize more robust approaches to genome and transcriptome annotation.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。