PolyA-GLM: A comprehensive framework for De novo polyadenylation site prediction using genome language models

PolyA-GLM:一个利用基因组语言模型进行从头多聚腺苷酸化位点预测的综合框架

阅读:2

Abstract

Polyadenylation sites (poly(A) sites) play a key role in the post-transcriptional regulation of gene expression. Accurate prediction of poly(A) sites is essential for identifying RNA processing defects associated with cancer and developmental disorders. Traditional approaches based on sequence motifs and experimental validation often struggle to generalize across different cell types and species. To address this limitation, we investigate the use of genome language models (GLMs) for poly(A) site prediction, leveraging their ability to capture long-range dependencies within genomic sequences. Specifically, we evaluate three state-of-the-art GLMs, DNABERT-2, Nucleotide Transformer, and HyenaDNA, using both few-shot classification and fine-tuning strategies. These models effectively recognize canonical polyadenylation signals (PASs) (i.e., AATAAA or other variants) and their spatial relationship (10-30 bp) to cleavage sites, with HyenaDNA achieving an AUC of 0.751 in the few-shot setting and improved performance after fine-tuning. We further validate model interpretability through systematic signal perturbation experiments, confirming their capacity to detect canonical PASs. Additionally, we propose a token-level classification approach for precise position-wise poly(A) site identification across extended gene regions. Finally, we present PolyA-GLM, an end-to-end pipeline for discovering novel poly(A) sites, highlighting the potential of GLMs to reveal regulatory elements overlooked by conventional methods. Overall, this work demonstrates the promise of GLMs in advancing our understanding of RNA processing and regulatory element discovery.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。