Abstract
Drug-drug interactions (DDIs) can lead to severe adverse reactions, and accurate prediction of DDI events is crucial for ensuring the safety of combination therapies and supporting drug development. Although deep learning-based approaches have achieved promising progress, existing models remain limited in modeling local-global dependencies, integrating multimodal information, and capturing cross-level molecular relationships. To address these challenges, we propose Multi-modal Hierarchical Attention Fusion and Relation-aware Architecture for DDI Event Prediction (MHAFR-DDI), a multimodal hierarchical attention fusion and relation-aware framework that enables unified modeling from intra-molecular representation to inter-molecular interaction. MHAFR-DDI adopts a two-stage pretraining-finetuning paradigm. In the pretraining stage, the model learns complementary representations from molecular sequences, 2D topological structures, and 3D spatial conformations, with modality-specific encoding mechanisms designed to capture both local structural characteristics and global semantic dependencies. Localized chemical primitives within each modality are first stabilized and then integrated into higher-level representations to ensure intra-modality stability and representational completeness. Subsequently, by introducing attention-guided data augmentation and multi-level contrastive learning, the model establishes alignment constraints across different modalities and their augmented views, thereby achieving cross-modal semantic consistency and effectively alleviating data sparsity. During the finetuning stage, the pretrained molecular representations are hierarchically fused and propagated over the drug-drug interaction graph, enabling interaction-aware information sharing among drugs and improving prediction reliability for rare drugs and long-tail interaction types. Experiments on benchmarks with 65 and 86 DDI types show that MHAFR-DDI outperforms state-of-the-art methods under the standard split, achieving macro-F1 gains of 9.5% and 6.7%, while remaining robust in weakly supervised long-tail and cold-start settings.