PlantLncBoost: key features for plant lncRNA identification and significant improvement in accuracy and generalization

PlantLncBoost:植物lncRNA鉴定的关键特性以及准确性和泛化能力的显著提升

阅读:1

Abstract

Long noncoding RNAs (lncRNAs) are critical regulators of numerous biological processes in plants. Nevertheless, their identification is challenging due to the low sequence conservation across various species. Existing computational methods for lncRNA identification often face difficulties in generalizing across diverse plant species, highlighting the need for more robust and versatile identification models. Here, we present PlantLncBoost, a novel computational tool designed to improve the generalization in plant lncRNA identification. By integrating advanced gradient boosting algorithms with comprehensive feature selection, our approach achieves both high accuracy and generalizability. We conducted an extensive analysis of 1662 features and identified three key features - ORF coverage, complex Fourier average, and atomic Fourier amplitude - that effectively distinguish lncRNAs from mRNAs. We assessed the performance of PlantLncBoost using comprehensive datasets from 20 plant species. The model exhibited exceptional performance, with an accuracy of 96.63%, a sensitivity of 98.42%, and a specificity of 94.93%, significantly outperforming existing tools. Further analysis revealed that the features we selected effectively capture the differences between lncRNAs and mRNAs across a variety of plant species. PlantLncBoost represents a significant advancement in plant lncRNA identification. It is freely accessible on GitHub (https://github.com/xuechantian/PlantLncBoost) and has been integrated into a comprehensive analysis pipeline, Plant-LncRNA-pipeline v.2 (https://github.com/xuechantian/Plant-LncRNA-pipeline-v2).

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。