Abstract
Peptide aggregation is a long-standing challenge in chemical peptide synthesis, limiting its efficiency and reliability. Although data-driven methods have enhanced our understanding of many sequence-based phenomena, no comprehensive approach addresses so-called non-random difficult couplings (generally linked to aggregation) during solid-phase peptide synthesis. Here we leverage existing peptide synthesis datasets, supplemented with further experimental data, to build a predictive model that deciphers the role of individual amino acids in triggering aggregation. We first identified and experimentally validated composition-dependent aggregation as a stronger predictor than sequence-based patterns. This insight enabled the development of a composition vector representation, allowing insights into the aggregation propensities of individual amino acids. Applying an ensemble of trained models, we predicted the aggregation properties of peptides and recommended the optimized use of aggregation-reducing tools. By elucidating each individual amino acid's influence, this method holds the potential to accelerate synthesis optimization through existing data, offering a robust framework for understanding and controlling peptide aggregation.