Abstract
Identifying interactions between two or more proteins is crucial as it helps understand living organisms' cellular behaviour and the underlying molecular mechanisms of various diseases. However, most existing computational algorithms in the field model this as a binary interaction between any two proteins, instead of conserving the evolutionary regions of protein function and interactions. This is important for predicting potential interaction sites, vital for drug design, target identification, and understanding disease progression and pathogenic mechanisms. Position-aware encoding provides a way to incorporate the order of amino acids in a protein sequence into the model, thus capturing folding patterns, leading to more accurate predictions of protein structures and their interactions. This is crucial because the sequence order can affect the structure and function of proteins. The proposed DensePPI-2 model is a novel bio-inspired substitution matrix-based sequence encoding with deep learning for identifying interacting protein pairs. It demonstrates an AUC of 97.13% on the S. cerevisiae dataset, improving by 1.4% over the best existing methods. Furthermore, DensePPI-2 outperforms recent sequence-based approaches on the human benchmark dataset, addressing the complexities of protein-protein interaction test classes. DensePPI-2 has been successfully applied for (i) identifying pathogen-host interactions and (ii) predicting near-residue-level interaction, even though the model was not trained on residue-level data. The enhanced performance on diverse test sets proves the efficiency of the bio-inspired sequence-to-image colour encoding strategy using the substitution matrices. The dataset and the developed models are available at https://github.com/CMATERJU-BIOINFO/DensePPI-2 for academic use only.