Abstract
INTRODUCTION: The precise identification of 5-methylcytosine (m5C), an epitranscriptomic modification fundamental to RNA function, is crucial yet proves difficult to achieve experimentally. Consequently, computational prediction offers a promising avenue; however, refining its predictive accuracy and ensuring its robustness remain ongoing objectives. To address these limitations, this study introduces a deep learning framework designed for highly accurate m5C site prediction from RNA sequences. METHODS: We propose FusDRM-m5C, a deep learning framework featuring a multi-branch architecture designed to process three distinct feature types: one-hot vector representation (one-hot), Z-curve-based geometrical features (Z-curve), and local RNA secondary structure (RSS). Each feature type is processed by a separate, parallel branch. Within each branch, a Dilated Convolutional Neural Network (DCNN) captures multi-scale patterns, followed by a Multi-Head Self-Attention (MHSA) mechanism with residual connections to weigh context-dependent information. For feature fusion, the high-level representations from the three branches are then integrated via concatenation. This fused feature vector is subsequently fed into a final fully connected network, which generates the prediction probability for precise m5C site identification. RESULTS: The performance of FusDRM-m5C was rigorously evaluated using both 5-fold cross-validation (CV) and independent dataset testing. On the 5-fold CV benchmark dataset, the model achieved high predictive accuracy, reflected by a Sensitivity (Sn) reaching 0.995, Specificity (Sp) of 0.971, Accuracy (ACC) at 0.983, Matthews correlation coefficient (MCC) measuring 0.966, and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.997. Crucially, when assessed on an independent test dataset, the model maintained strong generalization capability, attaining an Sn of 0.900, Sp of 0.965, Acc of 0.933, MCC of 0.867, and an AUC of 0.986. Furthermore, we assessed the cross-species prediction performance of FusDRM-m5C. The results demonstrated that the model consistently maintained high accuracy and robustness across datasets from multiple species, outperforming several existing state-of-the-art methods. DISCUSSION: The proposed FusDRM-m5C model demonstrates highly accurate and robust prediction of m5C sites, comparing favorably with existing methods. Its architecture effectively integrates diverse biological features through distinct processing pathways fused via attention, offering a powerful tool for m5C identification.