Abstract
N6-methyladenosine (m(6)A) plays a crucial regulatory role in the control of cellular functions and gene expression. Recent advances in sequencing techniques for transcriptome-wide m(6)A mapping have accelerated the accumulation of m(6)A site information at a single-nucleotide level, providing more high-confidence training data to develop computational approaches for m(6)A site prediction. However, it is still a major challenge to precisely predict m(6)A sites using in silico approaches. To advance the computational support for m(6)A site identification, here, we curated 13 up-to-date benchmark datasets from nine different species (i.e., H. sapiens, M. musculus, Rat, S. cerevisiae, Zebrafish, A. thaliana, Pig, Rhesus, and Chimpanzee). This will assist the research community in conducting an unbiased evaluation of alternative approaches and support future research on m(6)A modification. We revisited 52 computational approaches published since 2015 for m(6)A site identification, including 30 traditional machine learning-based, 14 deep learning-based, and 8 ensemble learning-based methods. We comprehensively reviewed these computational approaches in terms of their training datasets, calculated features, computational methodologies, performance evaluation strategy, and webserver/software usability. Using these benchmark datasets, we benchmarked nine predictors with available online websites or stand-alone software and assessed their prediction performance. We found that deep learning and traditional machine learning approaches generally outperformed scoring function-based approaches. In summary, the curated benchmark dataset repository and the systematic assessment in this study serve to inform the design and implementation of state-of-the-art computational approaches for m(6)A identification and facilitate more rigorous comparisons of new methods in the future.