Abstract
Allergy is an immune response triggered by specific peptides recognized by immune system effectors. While several bioinformatics tools have been developed to predict protein allergenicity, most rely on hand-selected features and lack interpretability. Improved predictive and explainable models are needed, especially for under-studied plant allergens. We present DeepPlantAllergy, a deep learning model that combines Convolutional Neural Networks (CNNs), Bidirectional Long Short-Term Memory (BiLSTM) networks, and Multi-Head Self-Attention (MHSA) to capture both local patterns and long-range dependencies within protein sequences. We evaluated four embedding techniques-including one-hot encoding, SeqVec, ProtBert, and ESM-1B-and employed Integrated gradients to identify residues contributing to allergenicity. Predictive performance was similar for ESM-1B and ProtBert embeddings, with no statistically significant difference, with an F1 score of 93.9% and 93.6% and AUC of 97.74% and 97.8%, respectively. Motif extraction revealed complementary strengths: ProtBert highlighted regions similar to OneHot patterns, while ESM captured distinct segments, and SeqVec identified additional regions overlapping with experimentally validated epitopes. Notably, molecular docking confirmed the biological plausibility of a predicted epitope, supporting the utility of residue-level predictions. DeepPlantAllergy thus offers both high predictive accuracy and interpretable insights, facilitating the discovery of allergenic motifs in under-characterized plant proteins. The source code, datasets used for training and evaluation, trained models, and the full pipeline for prediction and motif identification are available at the GitHub Repository: https://github.com/Lilly-dh/DeepPlantAllergy.