Abstract
Designing proteins with multiple optimized properties remains a fundamental challenge in biotechnology, especially when design objectives exhibit trade-offs or when structural templates are unavailable. We present scoring-assisted generative exploration for proteins (SAGE-Prot), a modular and extensible protein design framework that integrates autoregressive sequence generation, genetic algorithm(GA)-based diversification, and scoring-guided property evaluation in a closed-loop optimization process. Unlike conventional approaches, SAGE-Prot performs optimization directly at the sequence level without relying on structural templates for generation, while enabling structure-aware evaluation. Across rediscovery and similarity benchmarks involving 10 therapeutic proteins, hybrid language model/GA strategies implemented in SAGE-Prot consistently outperformed language model-only and heuristic baselines. Applied to two design problems, protein G domain B1 optimization for binding affinity and thermal stability, and TEM-1 β-lactamase optimization for enzymatic activity and solubility, SAGE-Prot effectively identified high-performing variants guided by predictive models trained on diverse sequence- and structure-derived descriptors. A curriculum learning (CL) strategy further accelerated convergence and improved design quality. Notably, experimental validation of six SAGE-Prot-designed TEM-1 β-lactamase variants confirmed up to a 752-fold increase in catalytic activity, underscoring the practical utility of this generative framework. These results highlight how coupling deep generative modeling with structure-informed evaluation and iterative fine-tuning enables generalizable, data-driven protein engineering across diverse optimization landscapes.