Abstract
SUMMARY: Accurately predicting protein-ligand interactions and binding affinities is essential for advancing structural biology. Despite recent advancements in deep learning, achieving rapid and precise predictions remains a challenging task. Our approach, Protein-Ligand Cross-Modal Fusion Predictor (PLXFPred), extracts physicochemical properties from amino acid sequences and SMILES. Additionally, it leverages pre-trained models to derive high-dimensional features. GATv2 and BILSTM were used to process the structural and sequence features, respectively. The model's core involves fusing sequence and graph features via a cross-modal cross-attention mechanism, followed by a multi-modal hierarchical fusion strategy that integrates high-level graph, early fusion, and cross-fusion features. Residual connections and conditional domain adversarial learning improve generalization to previously unseen protein-ligand pairs. Compared to state-of-the-art models, PLXFPred demonstrates superior performance, reducing errors (RMSD, MAE, SD) by over 50%, while providing interpretable biological insights through attention weight visualization and SHAP analysis. AVAILABILITY AND IMPLEMENTATION: The resource codes are available at https://github.com/xiyuyangtuo/PLXFPred/.