Abstract
Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody-specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1-dimensional sequence and 3-dimensional structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes the Sequence-Structure multi-level pre-trained Antibody Language Model (S(2)ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with 2 customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S(2)ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties, and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S(2)ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S(2)ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody-specific understanding and generation tasks. S(2)ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.