Abstract
Polyadenylation sites (poly(A) sites) play a key role in the post-transcriptional regulation of gene expression. Accurate prediction of poly(A) sites is essential for identifying RNA processing defects associated with cancer and developmental disorders. Traditional approaches based on sequence motifs and experimental validation often struggle to generalize across different cell types and species. To address this limitation, we investigate the use of genome language models (GLMs) for poly(A) site prediction, leveraging their ability to capture long-range dependencies within genomic sequences. Specifically, we evaluate three state-of-the-art GLMs, DNABERT-2, Nucleotide Transformer, and HyenaDNA, using both few-shot classification and fine-tuning strategies. These models effectively recognize canonical polyadenylation signals (PASs) (i.e., AATAAA or other variants) and their spatial relationship (10-30 bp) to cleavage sites, with HyenaDNA achieving an AUC of 0.751 in the few-shot setting and improved performance after fine-tuning. We further validate model interpretability through systematic signal perturbation experiments, confirming their capacity to detect canonical PASs. Additionally, we propose a token-level classification approach for precise position-wise poly(A) site identification across extended gene regions. Finally, we present PolyA-GLM, an end-to-end pipeline for discovering novel poly(A) sites, highlighting the potential of GLMs to reveal regulatory elements overlooked by conventional methods. Overall, this work demonstrates the promise of GLMs in advancing our understanding of RNA processing and regulatory element discovery.