Abstract
In computer-aided drug design, molecular representation plays a crucial role. Most existing multimodal approaches primarily perform simple concatenation of various feature representations, without adequately emphasizing effective integration among these features. To address this issue, this study proposes a network framework that integrates multimodal representations using a multihead attention flow (MulAFNet). MulAFNet utilizes SMILES string representation and two levels of molecular graph representations: atom-level and functional group-level graph structure. Pretraining tasks are established for each of these three representations, which are then fused in downstream tasks to predict molecular properties. The experiments were conducted on six classification data sets and three regression data sets, demonstrating that the use of multiple molecular representations as input has a significant impact on the results. In particular, the excellent performance of our fusion method in molecular property prediction outperforms other state-of-the-art methods, proving its superiority. Additionally, comparative experiments on fusion methods and ablation studies, further validate the effectiveness of MulAFNet. The results demonstrate that multiple molecular feature representations provide a more comprehensive molecular understanding, and appropriate pretraining tasks enhance molecular property prediction.