Abstract
A promoter is an important non-coding DNA sequence, as it can regulate gene expression. Its abnormalities are closely associated with various diseases, such as coronary heart disease, diabetes, and tumors. Therefore, promoter identification is highly significant. Due to the insufficient nonlinear feature extraction and insufficient capture of sequence context relationships, existing single promoter identification models have a lower classification performance. To overcome these shortcomings, this paper proposed a new model called iPro2L-Kresidual. iPro2L-Kresidual integrated a residual structure with a KAN network to design a novel Kresidual module. The Kresidual module significantly enhanced the nonlinear expression capability of sequence features by using B-spline functions and residual networks. Additionally, to fully capture the sequence context relationship, iPro2L-Kresidual improved a Transformer encoder module by replacing the linear processing method with gated recurrent units, so then it can extract both local and global context features of a sequence. Furthermore, iPro2L-Kresidual designed a regularized label smoothing cross-entropy loss function to ensure training stability and prevent the model from becoming overly confident. Experimental results on 5-fold cross-validation showed that the accuracy of promoter identification and promoter strength identification, respectively, was 94.28% and 90.55%. Moreover, on an independent dataset, the prediction accuracy reached 93.13%, further demonstrating the model's strong generalization ability. This provides a novel and effective predictive model for promoter site prediction.