Abstract
BACKGROUND: Reliable identification of protein‑protein interactions (PPIs) is crucial for deciphering cellular functional networks. Current research models still face limitations in aligning heterogeneous features and handling sparse supervision signals in graph learning. To address these issues, this study proposes a prediction framework named DSS‑PPI. This framework aims to enhance prediction performance by integrating multimodal sequence semantics with self‑supervised graph learning, thereby transforming static protein sequence embeddings into dynamic, topology‑aware representations. RESULTS: DSS‑PPI employs a dual‑stream architecture that synergistically integrates ProTrek’s cross‑modal aligned embeddings with ProtT5’s deep sequence features. The study innovatively constructs a context encoder that leverages Smith‑Waterman sequence similarity as quantitative edge features to guide graph attention weights, and incorporates Deep Graph Infomax (DGI) for self‑supervised pretraining. Furthermore, a gated fusion mechanism enables the model to adaptively integrate sequence semantics with network topological information. Experimental results indicate that the model achieves competitive performance compared to existing state‑of‑the‑art algorithms on both human and multi‑species benchmark datasets, with an accuracy of 0.73 on the rigorously designed Bernett test set. CONCLUSIONS: This study demonstrates the synergistic effect of multimodal embeddings and self‑supervised graph learning in PPI prediction. Ablation experiments and SHAP interpretability analysis further confirm that DSS‑PPI can effectively capture genuine physical interaction patterns. The framework provides a reliable computational tool for understanding complex biological networks and holds broad potential for biomedical applications. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-026-12762-3.