Abstract
Epidermal growth factor receptor (EGFR) overexpression is a key oncogenic driver in breast cancer, making it an important therapeutic target. Conventional approaches for EGFR identification, including motif- and homology-based methods, often lack accuracy and sensitivity, while experimental assays such as immunohistochemistry are costly and variable. To address these limitations, we propose a novel deep learning-based predictor, ERCNN-EGFR, for the accurate identification of EGFR proteins directly from primary amino acid sequences. Protein features were extracted using composition distribution transition (CDT), amphiphilic pseudo amino acid composition (AmpPseAAC), k-spaced conjoint triad descriptor (KSCTD), and ProtBERT-BFD embeddings. To reduce redundancy and enhance discriminative power, features were refined using XGBoost-Feature Forward Selection (XGBoost-FFS) approach. Multiple deep learning frameworks, including Bidirectional Long Short-Term Memory (BiLSTM), Gated Recurrent Unit (GRU), Generative Adversarial Network (GAN), and Ensemble Residual Convolutional Neural Network (ERCNN), were evaluated. Among them, ERCNN demonstrated Superior performance, achieving 93.48% accuracy, 94.53% sensitivity, 92.58% specificity, and a Matthews correlation coefficient of 0.816 after feature selection, and maintained robust performance on an independent test set (82.85% accuracy). Ablation analysis confirmed that dual residual building blocks and ProtBERT-BFD features were critical to the model's predictive strength. ERCNN-EGFR offers a scalable, cost-effective, and accurate computational approach for EGFR identification, with potential applications in breast cancer diagnostics, therapeutic target discovery, and personalized treatment strategies.