Abstract
One of the major challenges in natural product discovery is the prioritization of compounds with useful activities from microbial sources. Here, we utilize a machine learning model that predicts the antibacterial activity of a natural product from its biosynthetic gene cluster (BGC) into our genome mining pipeline. Using this approach, we prioritized the strain Amycolatopsis azurea DSM 43854 as a candidate strain encoding multiple BGCs with antibacterial-producing potential. Through bioactivity-guided fractionation, dipyrimicins A and B were isolated and, for the first time, linked to their BGC. This dip BGC was predicted by our model to encode a product with 76% antibacterial probability and shares only 40-52% similarity with previously characterized BGCs. The antimicrobial properties of the dipyrimicins were confirmed against a few test strains, and putative tailoring enzymes were identified, including an O-methyltransferase and amidotransferase that differentiated them from other related 2,2'-bipyridine biosynthetic pathways. Importantly, as the dip BGC was not in the training set of the model, this demonstrates the ability of the model to generalize beyond its training set and the potential of machine learning to accelerate novel bioactive natural product discovery and deorphanization of BGCs.