Abstract
Formulation development of protein biopharmaceuticals has become increasingly challenging due to new modalities and higher target drug substance concentrations. The limited amount of drug substance available during development, coupled with extensive analytical requirements, restrict the number of excipients that can be empirically screened. There is a strong need for in silico tools to optimize excipient pre-selection before wet lab experiments. Here, we introduce Excipient Prediction Software (ExPreSo), a supervised machine learning algorithm that suggests excipients based on the properties of the protein drug substance and target product profile. ExPreSo was trained on a dataset comprising 335 regulatory-approved peptide and protein drug products. Predictive features included protein structural properties, protein language model embeddings, and drug product characteristics. ExPreSo showed good performance for the nine most prevalent excipients in biopharmaceutical formulations and minimal overfitting. A fast variant of ExPreSo using only sequence-based input features showed similar prediction power to slower variants that relied on molecular modeling. Notably, an ExPreSo variant using only protein-based input features also showed good performance, indicating resilience to the influence of platform formulations. To our knowledge, this is the first machine learning algorithm to suggest biopharmaceutical excipients based on the dataset of regulatory-approved drug products. Overall, ExPreSo shows great potential to reduce the time, costs, and risks associated with excipient screening during formulation development.