Abstract
Relating billions of proteins across the tree of life remains a challenging task for comparative biosphere genomics and artificial intelligence-driven structure prediction. Here we present DIAMOND DeepClust, a cascaded, ultra-fast clustering method enabling planetary-scale organization of protein space, scaling to trillions of sequences while retaining sensitivity at low identity. Aggregating 19 billion biosphere proteins into 544 million nonsingleton clusters, we show that using our DeepClust database, available for download, can enhance structure prediction with AlphaFold2.