Abstract
Accurate prediction of protein-protein interactions (PPIs) is fundamental to understanding biological processes and disease mechanisms. While deep learning offers a powerful alternative to costly experimental methods, existing approaches often overlook critical protein-surface information and rely on simplistic feature fusion techniques, thereby limiting performance. To address this, we introduce GSMFormer-PPI, a novel multimodal framework that integrates protein molecular surface features, 3D structural graphs, and residue-level sequence embeddings. Our architecture employs geometric deep learning (MaSIF) to extract physicochemical surface descriptors, graph convolutional networks to process structural context, and a transformer encoder with linear projectors to learn complex, cross-modal interactions beyond simple concatenation. GSMFormer-PPI was evaluated on a curated PINDER dataset, and direct comparisons showed that it outperforms traditional graph-based models. Furthermore, a cross-dataset comparison revealed that it achieves similar or higher performance to that reported by other top models. Ablation studies confirm the critical contribution of surface features and our advanced fusion strategy to the model's superior predictive power. This work demonstrates that the integrative analysis of surface, structure, and sequence data is a vital and promising direction for advancing PPI prediction.