Abstract
The design and optimization of antibodies and nanobodies using deep generative models hold transformative potential for therapeutic and diagnostic applications, which are hindered by the fragmented and inconsistent nature of existing datasets. To address these limitations, we introduce the Antibody and Nanobody Design Dataset (ANDD), a unified dataset that integrates sequence, structure, antigen, and affinity data from 15 diverse sources. ANDD is a comprehensive resource comprising 48,683 antibody/nanobody sequences, with structural data for 24,941 entries, and antigen sequences for 12,575 entries. We further augmented the affinity data with 2,271 predicted affinity values using ANTIPASTI, a robust model for binding affinity prediction. Consequently, ANDD includes 9,557 affinity values, making it the largest dataset to date for antibody/nanobody and antigen pairs with affinity data. By addressing challenges of data fragmentation and inconsistency, ANDD provides a robust foundation for training deep generative models. With ANDD, the models can better model antibody/nanobody-antigen interactions, while design novel antibodies and nanobodies with improved specificity and efficacy, paving the way for development of targeted therapeutics.